Handling missing data

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Prerequisites

Practicing Machine Learning Interview Questions in Python

Course outline

  • Chapter 1: Pre-processing and Visualization
    • Missing data, Outliers, Normalization
  • Chapter 2: Supervised Learning
    • Feature selection, Regularization, Feature engineering
  • Chapter 3: Unsupervised Learning
    • Cluster algorithm selection, Feature extraction, Dimension reduction
  • Chapter 4: Model Selection and Evaluation
    • Model generalization and evaluation, Model selection
Practicing Machine Learning Interview Questions in Python

Machine learning (ML) pipeline

Pipeline Overview

Practicing Machine Learning Interview Questions in Python

Our ML pipeline

Pipeline

Practicing Machine Learning Interview Questions in Python

Missing data

  • Impact of different techniques
  • Finding missing values
  • Strategies to handle
Practicing Machine Learning Interview Questions in Python

Techniques

  1. Omission
    • Removal of rows --> .dropna(axis=0)
    • Removal of columns --> .dropna(axis=1)
  2. Imputation
    • Fill with zero --> SimpleImputer(strategy='constant', fill_value=0)
    • Impute mean -> SimpleImputer(strategy='mean')
    • Impute median --> SimpleImputer(strategy='median')
    • Impute mode --> SimpleImputer(strategy='most_frequent')
    • Iterative imputation --> IterativeImputer()
Practicing Machine Learning Interview Questions in Python

Why bother?

  • Reduce the probability of introducing bias
  • Most ML algorithms require complete data
Practicing Machine Learning Interview Questions in Python

Effects of imputation

  • Depend on:
    • Missing values
    • Original variance
    • Presence of outliers
    • Size and direction of skew
  • Omission --> Can remove too much
  • Zero --> Bias results downward
  • Mean --> Affected more by outliers
  • Median --> Better in case of outliers
  • Mode and iterative imputation--> Try them out
Practicing Machine Learning Interview Questions in Python
Function returns
df.isna().sum() number missing
df['feature'].mean() feature mean
.shape row, column dimensions
df.columns column names
.fillna(0) fills missing with 0
select_dtypes(include = [np.number] ) numeric columns
select_dtypes(include = ['object'] ) string columns
.fit_transform(numeric_cols) fits and transforms
Practicing Machine Learning Interview Questions in Python

Effects of missing values

What are the effects of missing values in a Machine Learning (ML) setting? Select the answer that is true:

  • Missing values aren't a problem since most of sklearn's algorithms can handle them.
  • Removing observations or features with missing values is generally a good idea.
  • Missing data tends to introduce bias that leads to misleading results so they cannot be ignored.
  • Filling missing values with zero will bias results upward.
Practicing Machine Learning Interview Questions in Python

Effect of missing values: answer

What are the effects of missing values in a Machine Learning (ML) setting? The correct answer is:

  • Missing data tends to introduce bias that leads to misleading results so they cannot be ignored. (Filling missing values by testing which impacts the variance of a given dataset the least is the best approach.)
Practicing Machine Learning Interview Questions in Python

Effects of missing values: incorrect answers

What are the effects of missing values in a Machine Learning (ML) setting?

  • Missing values aren't a problem... (Most of sklearn's algorithms cannot handle missing values and will throw an error.)
  • Removing observations or features with missing values... (Unless your dataset is large and the proportion of missing values small, removing rows or columns with missing data usually results in shrinking your dataset too much to be useful in subsequent ML.)
  • Filling missing values with zero will bias results upward.(It's the opposite, filling with zero will bias results downward.)
Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...