Demystifying Data Quality in Machine Learning π€
DATA is the new oilπ’οΈ As #DataCentric approaches to #ML gather traction, access to diverse, comprehensive, and more importantly quality data has been the talk of the town. Along these lines, it's important to understand what does QUALITY really means in the context of DATA π’π§΅ππ»
DATA is the new oilπ’οΈ As #DataCentric approaches to #ML gather traction, access to diverse, comprehensive, and more importantly quality data has been the talk of the town. Along these lines, it's important to understand what does QUALITY really means in the context of DATA π’π§΅ππ»
— Sculpting Data: First Act of Machine Learning π (@DataForML) September 10, 2021
DATA QUALITY refers to the qualitative/quantitative state of information we possess. Factors like accuracy, completeness, consistency, reliability, and whether data is up to date can help us measure it βοΈ Check out this cool illustration by @GradientFlowR!
However sophisticated our #MachineLearning models might be, poor quality data can never help effectively solve the problem at hand. Here are some pointers to keep a quality check on our data going forward π€π»
The first step is to understand data, and how it solves our use case, a concept called DATA PROFILING π₯Έ It involves reviewing the source, understanding structure, summarizing data volume, side-information, other stats, etc. This step helps uncover quality issues from the get-go.
Data changes over time, so itβs crucial to develop DATA HEALTH MEASUREMENTS to signal when quality degrades π Metrics developed along the data quality dimensions mentioned previously can be very helpful in continuous health monitoring and we can tailor them for our use cases.
Despite all checks, quality issues might still creep in. This is where DATA REPAIR comes into play π οΈ We should invest in tools that help in error diagnosis & automatically apply easy fixes like deduplication, imputation, etc. while seeking manual help for advanced resolutions.
If you are interested in reading more about QUALITY DATA and how to build your own dataset from scratch, grab a copy of Sculpting DataForML today and learn how to sculpt data the right way using #Python and other #OpenSource tools!
We found the following article very insightful, it talks about the concept of DATA QUALITY and ways to combat it. Do read it if you get a chance! gradientflow.com/data-quality-unpacked