Member-only story

Best practices for handling data in machine learning

Tudor Surdoiu
8 min readJun 14, 2021

--

Photo by Tobias Fischer on Unsplash

In this article I am going to tackle the most common data related problems a machine learning practitioner could encounter and present several ways in which one can handle them.

Content list:

  • Outliers
  • Missing values
  • Data leakage
  • Data augmentation
  • Data partitioning
  • Data imbalance
  • Data sampling

Lets begin!

Outliers

Outliers are data points that based on some distance metric are considered dissimilar. In a high dimensional space some outliers can become rather hard to detect and a dimensionality reduction technique can be applied to make the problem more manageable.

To handle outliers we have three solutions: use an algorithm that is robust to outliers, simply remove them from the dataset or replace them.

SVMs and Neural Networks are less sensitive to outliers, the former has a hyperparameter that controls the sensitivity to miss-classifications by using a soft detection boundary and the latter can be complex enough to learn to differentiate them.

--

--

Tudor Surdoiu
Tudor Surdoiu

Written by Tudor Surdoiu

Bio digital jazz writer, sometimes knocking on the sky and listening to the sound.

No responses yet