5 tips for polishing your Machine Learning dataset 🧼

I have been professionally working as a Machine Learning Engineer since more than 2 years now and also, recently co-authored a book titled “Sculpting Data for ML: The first act of Machine Learning”. My past few experience have taught me that data does not get its due limelight in #MachineLearning as compared to complex model architecture. Keeping up with 'more data beats clever algorithms, but better data beats more data', here are top 5 tips for polishing the dataset to effectively solve #ML problems 🤖👇🏻

DATA IS SEASONAL☃️🏝🌧 Every data point has its expiration date! With infinite data streams today, it is important to continuously perform data distribution checks to maintain the ‘data is IID’ status, more so if you are training on-the-go.

FIGHT THE BIAS🤺🧯💥 While data might represent the ultimate truth, the way a dataset is created might not. Any form of #DatasetBias can limit generalization capabilities of even the most sophisticated #ML algorithm. And thus, unintentionally lead to collective, disparate impact.

MIX IT UP 🍹🍸🥃 Many times, simply not having enough data becomes a blocker. In such cases, it helps to identify related datasets and combine them using horizontal/vertical integration. If that is not possible, data augmentation and oversampling techniques may come in handy.

PROBLEM OF PLENTY ✂️👎🏻🎬 More data def helps, but sometimes a huge dataset takes a toll on training time and computation resources. Cutting down on data points while still maintaining model performance by using techniques like importance sampling or stratified sampling helps.

DATA ABOUT THE DATA 🤓🔬📈 Any dataset we use must have enough metadata. Sufficient amount of side-information helps in engineering more/better features so that our #ML model can decipher patterns more effectively.

PRO TIP ✨📖👩🏻‍💻 Lastly, you are in no way restricted by the availability of public dataset repositories. Sky's the limit. Grab a copy of @DataForML today and learn how to create your very own quality dataset using #Python and other #OpenSource tools!



Written on July 18, 2021