Categorical features

Winning a Kaggle Competition in Python

Yauhen Babakhin

Kaggle Grandmaster

Label encoding

ID Categorical feature
1 A
2 B
3 C
4 A
5 D
6 A
ID Label-encoded
1 0
2 1
3 2
4 0
5 3
6 0
Winning a Kaggle Competition in Python

Label encoding

# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object le = LabelEncoder()
# Encode a categorical feature df['cat_encoded'] = le.fit_transform(df['cat'])
     ID   cat  cat_encoded
0     1   A    0
1     2   B    1
2     3   C    2
3     4   A    0
Winning a Kaggle Competition in Python

One-Hot encoding

ID Categorical feature
1 A
2 B
3 C
4 A
5 D
6 A
ID Cat == A Cat == B Cat == C Cat == D
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 1 0 0 0
5 0 0 0 1
6 1 0 0 0
Winning a Kaggle Competition in Python

One-Hot encoding

# Create One-Hot encoded features
ohe = pd.get_dummies(df['cat'], prefix='ohe_cat')

# Drop the initial feature df.drop('cat', axis=1, inplace=True)
# Concatenate OHE features to the dataframe df = pd.concat([df, ohe], axis=1)
     ID ohe_cat_A ohe_cat_B ohe_cat_C ohe_cat_D
0     1         1         0         0         0
1     2         0         1         0         0
2     3         0         0         1         0
3     4         1         0         0         0
Winning a Kaggle Competition in Python

Binary Features

# DataFrame with a binary feature
binary_feature
      binary_feat
0     Yes
1     No
le = LabelEncoder()
binary_feature['binary_encoded'] = le.fit_transform(binary_feature['binary_feat'])
  binary_feat binary_encoded
0     Yes     1
1     No      0
Winning a Kaggle Competition in Python

Other encoding approaches

  • Backward Difference Coding
  • BaseN
  • Binary
  • CatBoost Encoder
  • Hashing
  • Helmert Coding
  • James-Stein Encoder
  • Leave One Out
  • M-estimate
  • One Hot
  • Ordinal
  • Polynomial Coding
  • Sum Coding
  • Target Encoder
  • Weight of Evidence
Winning a Kaggle Competition in Python

Other encoding approaches

  • Backward Difference Coding
  • BaseN
  • Binary
  • CatBoost Encoder
  • Hashing
  • Helmert Coding
  • James-Stein Encoder
  • Leave One Out
  • M-estimate
  • One Hot
  • Ordinal
  • Polynomial Coding
  • Sum Coding
  • Target Encoder
  • Weight of Evidence
Winning a Kaggle Competition in Python

Let's practice!

Winning a Kaggle Competition in Python

Preparing Video For Download...