Data Cascades in Machine Learning ⛲️
Machine Learning models are as good as the data they consume🍴Data impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES 🌊 🧵👇🏻
ML models are as good as the data they consume🍴Data impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES 🌊 🧵👇🏻— Sculpting Data: First Act of Machine Learning 📖 (@DataForML) July 28, 2021
TROUBLEMAKER 😈🙅🏻♀️😼 Often the issues originate early in #ML system’s life cycle, for example at the stage where we collect or annotate the data. These seemingly small issues grow larger into complex challenges affecting the #MachineLearning model development and deployment.
DETECT ‘EM 🕵🏻♀️👀🤓 The diagnosis of #DataCascades is a tough one, especially since there are no clear indicators, tools, to detect them or well-defined metrics to measure their long term effects. We can and should prioritize data in an #ML system along with model development.
DATA FIRST 🎖🎁🏆 Efforts towards bringing empiricism in data should be incentivized in the organization (and #MachineLearning community as a whole). This means recognizing work on rewarding dataset collection, labeling, cleaning, or maintenance, as much as any modeling work.
MONITOR YOUR DATA 👀📈🔬 Similar to developing good model performance metrics, we should develop metrics to measure health & goodness of data too! Metrics related to data distribution, feature coverage, data freshness, etc. can be used to measure/identify #DataCascades early on.
SCULPT IT RIGHT 🧱⏳📖 To avoid #DataCascading, get started right by fostering data literacy for #MachineLearning. Read through @DataForML and learn how to create your very own quality dataset using #Python and other #OpenSource tools!
WANNA KNOW MORE ❓📝🤓 Read more about the concept in this blog post: ai.googleblog.com/data-cascades-in-machine-learning