Data Cascades in Machine Learning โฒ๏ธ
Machine Learning models are as good as the data they consume๐ดData impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES ๐ ๐งต๐๐ป
ML models are as good as the data they consume๐ดData impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES ๐ ๐งต๐๐ป
— Sculpting Data: First Act of Machine Learning ๐ (@DataForML) July 28, 2021
TROUBLEMAKER ๐๐ ๐ปโโ๏ธ๐ผ Often the issues originate early in #ML systemโs life cycle, for example at the stage where we collect or annotate the data. These seemingly small issues grow larger into complex challenges affecting the #MachineLearning model development and deployment.
DETECT โEM ๐ต๐ปโโ๏ธ๐๐ค The diagnosis of #DataCascades is a tough one, especially since there are no clear indicators, tools, to detect them or well-defined metrics to measure their long term effects. We can and should prioritize data in an #ML system along with model development.
DATA FIRST ๐๐๐ Efforts towards bringing empiricism in data should be incentivized in the organization (and #MachineLearning community as a whole). This means recognizing work on rewarding dataset collection, labeling, cleaning, or maintenance, as much as any modeling work.
MONITOR YOUR DATA ๐๐๐ฌ Similar to developing good model performance metrics, we should develop metrics to measure health & goodness of data too! Metrics related to data distribution, feature coverage, data freshness, etc. can be used to measure/identify #DataCascades early on.
SCULPT IT RIGHT ๐งฑโณ๐ To avoid #DataCascading, get started right by fostering data literacy for #MachineLearning. Read through @DataForML and learn how to create your very own quality dataset using #Python and other #OpenSource tools!
WANNA KNOW MORE โ๐๐ค Read more about the concept in this blog post: ai.googleblog.com/data-cascades-in-machine-learning