Missing data

Winning a Kaggle Competition in Python

Yauhen Babakhin

Kaggle Grandmaster

Missing data

ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A NaN 1
5 NaN 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A NaN 1
5 NaN 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
  • Constant value imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A 4.72 1
5 NaN 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
  • Constant value imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A -999 1
5 NaN 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
  • Constant value imputation

Categorical data

  • Most frequent category imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A -999 1
5 NaN 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
  • Constant value imputation

Categorical data

  • Most frequent category imputation
  • New category imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A -999 1
5 A 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Impute missing data

 

Numerical data

  • Mean/median imputation
  • Constant value imputation

Categorical data

  • Most frequent category imputation
  • New category imputation
ID Categorical feature Numerical feature Binary target
1 A 5.1 1
2 B 7.2 0
3 C 3.4 0
4 A -999 1
5 MISS 2.6 0
6 A 5.3 0
Winning a Kaggle Competition in Python

Find missing data

df.isnull().head(1)
         ID       cat       num    target
0     False     False     False     False
df.isnull().sum()
ID        0
cat       1
num       1
target    0
Winning a Kaggle Competition in Python

Numerical missing data

# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Different types of imputers mean_imputer = SimpleImputer(strategy='mean') constant_imputer = SimpleImputer(strategy='constant', fill_value=-999)
# Imputation df[['num']] = mean_imputer.fit_transform(df[['num']])
Winning a Kaggle Competition in Python

Categorical missing data

# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Different types of imputers
frequent_imputer = SimpleImputer(strategy='most_frequent')
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISS')

# Imputation df[['cat']] = constant_imputer.fit_transform(df[['cat']])
Winning a Kaggle Competition in Python

Let's practice!

Winning a Kaggle Competition in Python

Preparing Video For Download...