Member-only story
Best practices for handling data in machine learning
In this article I am going to tackle the most common data related problems a machine learning practitioner could encounter and present several ways in which one can handle them.
Content list:
- Outliers
- Missing values
- Data leakage
- Data augmentation
- Data partitioning
- Data imbalance
- Data sampling
Lets begin!
Outliers
Outliers are data points that based on some distance metric are considered dissimilar. In a high dimensional space some outliers can become rather hard to detect and a dimensionality reduction technique can be applied to make the problem more manageable.
To handle outliers we have three solutions: use an algorithm that is robust to outliers, simply remove them from the dataset or replace them.
SVMs and Neural Networks are less sensitive to outliers, the former has a hyperparameter that controls the sensitivity to miss-classifications by using a soft detection boundary and the latter can be complex enough to learn to differentiate them.