Data Cleaning is the most attractive and most important part of Data Science and Machine Learning. Data Cleaning is a must for good data for the best optimized Machine Learning Model and there is no shortcut or tricks to do best Data Cleaning for an optimized model. Data Scientists and Machine Learning Engineers spend most of their time in Data Cleaning as they believe that the output of an algorithm totally depends on the input or data.
Following are some ways for Data Cleaning:-
Better Data Quality
This is the most common issue in developers that they chase perfection of the algorithm ignoring one of the major factors for the success of the algorithm is Data Quality. Data Cleaning is more important than it sounds as none of the most efficient or fancy algorithms can perform well without good data. Poor Quality Data may also lead to biased algorithms that will not allow any business to find any flaws in it.
Outliers can cause problems with many models like Linear Regression models, it will reduce the robustness of these types of models. But removing the outliers unnecessarily and just because of big values can lead to missing out the information for the model. Outliers that are uninformative are only to be removed, others may play as important information for a model. So developers should have a good or legitimate reason before removing an outlier.
This is the major part of Data Cleaning for Data Science. Duplicate observations may occur during Data Collection as the Data is collected from various data sources. Data Duplicacy may occur in combining datasets from multiple places, receiving data from other parties and scraping data. Also, some of these duplicates are irrelevant so finding this type of duplicate data and removing it may help in improving the model’s performance.
Checking for Pad-Strings
Strings in data can be padded with whitespaces and other characters. You need to find out these strings and observe that all the data in the column follows the same padding. For Example in numerical code, there are zeros added before it, so you should observe all the codes have the same format and have the same no. of digits.
555 ==> 000555(6 digits)
7777 ==> 007777(6 digits)
Removing White Spaces
At many times while data is collected from various sources, strings may contain some extra whitespaces which are to be removed. Extra whitespace may be there at the beginning or the ending of the string. Removing extra whitespace can improve the meaning of the string for the model can improve the accuracy by handling properly formatted string.
“ Hello World ” ==> “Hello World”
Structural Errors are those which occur when Data is transferred from one source to another or while measuring the data. Structural errors are Typos like spelling mistakes. It can also be Inconsistent Capitalization means inconsistently using capitals in strings. Structural Errors can also be solved by merging or including miss-labeled classes into one class.
Standardization for Strings means to standardize strings either to lower case or upper case. Standardization for Numbers means the values must be standardized to one specific unit. For Example, length can be in meters or feet. One should convert the data in a single unit in the observation where the values are not having the same unit.
Most of the Machine Learning Algorithms do not support missing data so this is the most crucial part for Data Cleaning.
Missing Data can be handled in 2 ways :
1) Dropping missing values
One should drop the irrelevant and missing information from the data. But Dropping missing values is not an efficient way for Data. It may Drop the main information along with the missing data.
2) Imputing missing values
One can Impute Missing values by adding most frequent or mean or median to the missing values from all the observations. But this also not an Optimum solution for Machine learning, the meaning of some information may get changed due to this imputed value
Optimal Solution for missing data is:-
For Categorical Data: One can add the ‘Missing’ label for missing categorical feature data.
For Numerical Data: One can add flag or 0 to the missing numerical data not to remove the meaning of other data related to it.