Data Cascades in Machine Learning โ›ฒ๏ธ

Machine Learning models are as good as the data they consume๐ŸดData impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES ๐ŸŒŠ ๐Ÿงต๐Ÿ‘‡๐Ÿป



TROUBLEMAKER ๐Ÿ˜ˆ๐Ÿ™…๐Ÿปโ€โ™€๏ธ๐Ÿ˜ผ Often the issues originate early in #ML systemโ€™s life cycle, for example at the stage where we collect or annotate the data. These seemingly small issues grow larger into complex challenges affecting the #MachineLearning model development and deployment.

DETECT โ€˜EM ๐Ÿ•ต๐Ÿปโ€โ™€๏ธ๐Ÿ‘€๐Ÿค“ The diagnosis of #DataCascades is a tough one, especially since there are no clear indicators, tools, to detect them or well-defined metrics to measure their long term effects. We can and should prioritize data in an #ML system along with model development.

DATA FIRST ๐ŸŽ–๐ŸŽ๐Ÿ† Efforts towards bringing empiricism in data should be incentivized in the organization (and #MachineLearning community as a whole). This means recognizing work on rewarding dataset collection, labeling, cleaning, or maintenance, as much as any modeling work.

MONITOR YOUR DATA ๐Ÿ‘€๐Ÿ“ˆ๐Ÿ”ฌ Similar to developing good model performance metrics, we should develop metrics to measure health & goodness of data too! Metrics related to data distribution, feature coverage, data freshness, etc. can be used to measure/identify #DataCascades early on.

SCULPT IT RIGHT ๐Ÿงฑโณ๐Ÿ“– To avoid #DataCascading, get started right by fostering data literacy for #MachineLearning. Read through @DataForML and learn how to create your very own quality dataset using #Python and other #OpenSource tools!



WANNA KNOW MORE โ“๐Ÿ“๐Ÿค“ Read more about the concept in this blog post: ai.googleblog.com/data-cascades-in-machine-learning




Written on July 28, 2021