Target encoding

Winning a Kaggle Competition in Python

Yauhen Babakhin

Kaggle Grandmaster

High cardinality categorical features

 

  • Label encoder provides distinct number for each category
  • One-hot encoder creates new feature for each category value
  • Target encoding to the rescue!
Winning a Kaggle Competition in Python

Mean target encoding

Train ID Categorical Target
1 A 1
2 B 0
3 B 0
4 A 1
5 B 0
6 A 0
7 B 1
Test ID Categorical Target
10 A ?
11 A ?
12 B ?
13 A ?
Winning a Kaggle Competition in Python

Mean target encoding

 

  1. Calculate mean on the train, apply to the test
  2. Split train into K folds. Calculate mean on (K-1) folds, apply to the K-th fold
  3. Add mean target encoded feature to the model
Winning a Kaggle Competition in Python

Calculate mean on the train

Train ID Categorical Target
1 A 1
2 B 0
3 B 0
4 A 1
5 B 0
6 A 0
7 B 1
Winning a Kaggle Competition in Python

Calculate mean on the train

Train ID Categorical Target
1 A 1
2 B 0
3 B 0
4 A 1
5 B 0
6 A 0
7 B 1
Winning a Kaggle Competition in Python

Calculate mean on the train

Train ID Categorical Target
1 A 1
2 B 0
3 B 0
4 A 1
5 B 0
6 A 0
7 B 1
Winning a Kaggle Competition in Python

Test encoding

Test ID Categorical Target Mean encoded
10 A ? 0.66
11 A ? 0.66
12 B ? 0.25
13 A ? 0.66
Winning a Kaggle Competition in Python

Train encoding using out-of-fold

Train ID Categorical Target Fold
1 A 1 1
2 B 0 1
3 B 0 1
4 A 1 1
5 B 0 2
6 A 0 2
7 B 1 2
Winning a Kaggle Competition in Python

Train encoding using out-of-fold

Train ID Categorical Target Fold Mean encoded
1 A 1 1
2 B 0 1
3 B 0 1
4 A 1 1
5 B 0 2
6 A 0 2
7 B 1 2
Winning a Kaggle Competition in Python

Train encoding using out-of-fold

Train ID Categorical Target Fold Mean encoded
1 A 1 1 0
2 B 0 1 0.5
3 B 0 1 0.5
4 A 1 1 0
5 B 0 2
6 A 0 2
7 B 1 2
Winning a Kaggle Competition in Python

Train encoding using out-of-fold

Train ID Categorical Target Fold Mean encoded
1 A 1 1 0
2 B 0 1 0.5
3 B 0 1 0.5
4 A 1 1 0
5 B 0 2
6 A 0 2
7 B 1 2
Winning a Kaggle Competition in Python

Train encoding using out-of-fold

Train ID Categorical Target Fold Mean encoded
1 A 1 1 0
2 B 0 1 0.5
3 B 0 1 0.5
4 A 1 1 0
5 B 0 2 0
6 A 0 2 1
7 B 1 2 0
Winning a Kaggle Competition in Python

Practical guides

Winning a Kaggle Competition in Python

Practical guides

Smoothing

$$mean\_enc_i = \frac{target\_sum_i}{n_i}$$

$$smoothed\_mean\_enc_i = \frac{target\_sum_i + \alpha*global\_mean}{n_i + \alpha}$$

$$\alpha \in [5; 10]$$

Winning a Kaggle Competition in Python

Practical guides

Smoothing

$$mean\_enc_i = \frac{target\_sum_i}{n_i}$$

$$smoothed\_mean\_enc_i = \frac{target\_sum_i + \alpha*global\_mean}{n_i + \alpha}$$

$$\alpha \in [5; 10]$$

New categories

  • Fill new categories in the test data with a global_mean
Winning a Kaggle Competition in Python

Practical guides

 

Train ID Categorical Target
1 A 1
2 B 0
3 B 0
4 A 0
5 B 1

 

Test ID Categorical Target Mean encoded
10 A ? 0.43
11 B ? 0.38
12 C ? 0.40
Winning a Kaggle Competition in Python

Let's practice!

Winning a Kaggle Competition in Python

Preparing Video For Download...