Member-only story

Best practices for handling data in machine learning

8 min readJun 14, 2021

In this article I am going to tackle the most common data related problems a machine learning practitioner could encounter and present several ways in which one can handle them.

Content list:

Outliers
Missing values
Data leakage
Data augmentation
Data partitioning
Data imbalance
Data sampling

Lets begin!

Outliers

Outliers are data points that based on some distance metric are considered dissimilar. In a high dimensional space some outliers can become rather hard to detect and a dimensionality reduction technique can be applied to make the problem more manageable.

To handle outliers we have three solutions: use an algorithm that is robust to outliers, simply remove them from the dataset or replace them.

SVMs and Neural Networks are less sensitive to outliers, the former has a hyperparameter that controls the sensitivity to miss-classifications by using a soft detection boundary and the latter can be complex enough to learn to differentiate them.

Best practices for handling data in machine learning

Outliers

Written by Tudor Surdoiu

No responses yet