Data Cascades in Machine Learning ⛲️

Machine Learning models are as good as the data they consume🍴Data impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES 🌊 🧵👇🏻

ML models are as good as the data they consume🍴Data impacts performance, fairness, robustness & scalability of #ML Systems. If not taken care of, it leads to a TON of tech debt over time in a corporate setting, downstream effects of which are termed as DATA CASCADES 🌊 🧵👇🏻
— Sculpting Data: First Act of Machine Learning 📖 (@DataForML) July 28, 2021

TROUBLEMAKER 😈🙅🏻‍♀️😼 Often the issues originate early in #ML system’s life cycle, for example at the stage where we collect or annotate the data. These seemingly small issues grow larger into complex challenges affecting the #MachineLearning model development and deployment.

DETECT ‘EM 🕵🏻‍♀️👀🤓 The diagnosis of #DataCascades is a tough one, especially since there are no clear indicators, tools, to detect them or well-defined metrics to measure their long term effects. We can and should prioritize data in an #ML system along with model development.

DATA FIRST 🎖🎁🏆 Efforts towards bringing empiricism in data should be incentivized in the organization (and #MachineLearning community as a whole). This means recognizing work on rewarding dataset collection, labeling, cleaning, or maintenance, as much as any modeling work.

MONITOR YOUR DATA 👀📈🔬 Similar to developing good model performance metrics, we should develop metrics to measure health & goodness of data too! Metrics related to data distribution, feature coverage, data freshness, etc. can be used to measure/identify #DataCascades early on.

SCULPT IT RIGHT 🧱⏳📖 To avoid #DataCascading, get started right by fostering data literacy for #MachineLearning. Read through @DataForML and learn how to create your very own quality dataset using #Python and other #OpenSource tools!

WANNA KNOW MORE ❓📝🤓 Read more about the concept in this blog post: ai.googleblog.com/data-cascades-in-machine-learning

Written on July 28, 2021

Jigyasa Grover

Generative AI & Research Lead 👩🏻‍💻

'Sculpting Data for ML' Book Author 📖

10x Award Winner in AI & Open Source 🏆

Google Dev Expert & Featured in Google I/O 🎬

Data Cascades in Machine Learning ⛲️