Winning a Kaggle Competition in Python
Yauhen Babakhin
Kaggle Grandmaster
| Train ID | Categorical | Target |
|---|---|---|
| 1 | A | 1 |
| 2 | B | 0 |
| 3 | B | 0 |
| 4 | A | 1 |
| 5 | B | 0 |
| 6 | A | 0 |
| 7 | B | 1 |
| Test ID | Categorical | Target |
|---|---|---|
| 10 | A | ? |
| 11 | A | ? |
| 12 | B | ? |
| 13 | A | ? |
| Train ID | Categorical | Target |
|---|---|---|
| 1 | A | 1 |
| 2 | B | 0 |
| 3 | B | 0 |
| 4 | A | 1 |
| 5 | B | 0 |
| 6 | A | 0 |
| 7 | B | 1 |
| Train ID | Categorical | Target |
|---|---|---|
| 1 | A | 1 |
| 2 | B | 0 |
| 3 | B | 0 |
| 4 | A | 1 |
| 5 | B | 0 |
| 6 | A | 0 |
| 7 | B | 1 |
| Train ID | Categorical | Target |
|---|---|---|
| 1 | A | 1 |
| 2 | B | 0 |
| 3 | B | 0 |
| 4 | A | 1 |
| 5 | B | 0 |
| 6 | A | 0 |
| 7 | B | 1 |
| Test ID | Categorical | Target | Mean encoded |
|---|---|---|---|
| 10 | A | ? | 0.66 |
| 11 | A | ? | 0.66 |
| 12 | B | ? | 0.25 |
| 13 | A | ? | 0.66 |
| Train ID | Categorical | Target | Fold |
|---|---|---|---|
| 1 | A | 1 | 1 |
| 2 | B | 0 | 1 |
| 3 | B | 0 | 1 |
| 4 | A | 1 | 1 |
| 5 | B | 0 | 2 |
| 6 | A | 0 | 2 |
| 7 | B | 1 | 2 |
| Train ID | Categorical | Target | Fold | Mean encoded |
|---|---|---|---|---|
| 1 | A | 1 | 1 | |
| 2 | B | 0 | 1 | |
| 3 | B | 0 | 1 | |
| 4 | A | 1 | 1 | |
| 5 | B | 0 | 2 | |
| 6 | A | 0 | 2 | |
| 7 | B | 1 | 2 |
| Train ID | Categorical | Target | Fold | Mean encoded |
|---|---|---|---|---|
| 1 | A | 1 | 1 | 0 |
| 2 | B | 0 | 1 | 0.5 |
| 3 | B | 0 | 1 | 0.5 |
| 4 | A | 1 | 1 | 0 |
| 5 | B | 0 | 2 | |
| 6 | A | 0 | 2 | |
| 7 | B | 1 | 2 |
| Train ID | Categorical | Target | Fold | Mean encoded |
|---|---|---|---|---|
| 1 | A | 1 | 1 | 0 |
| 2 | B | 0 | 1 | 0.5 |
| 3 | B | 0 | 1 | 0.5 |
| 4 | A | 1 | 1 | 0 |
| 5 | B | 0 | 2 | |
| 6 | A | 0 | 2 | |
| 7 | B | 1 | 2 |
| Train ID | Categorical | Target | Fold | Mean encoded |
|---|---|---|---|---|
| 1 | A | 1 | 1 | 0 |
| 2 | B | 0 | 1 | 0.5 |
| 3 | B | 0 | 1 | 0.5 |
| 4 | A | 1 | 1 | 0 |
| 5 | B | 0 | 2 | 0 |
| 6 | A | 0 | 2 | 1 |
| 7 | B | 1 | 2 | 0 |
$$mean\_enc_i = \frac{target\_sum_i}{n_i}$$
$$smoothed\_mean\_enc_i = \frac{target\_sum_i + \alpha*global\_mean}{n_i + \alpha}$$
$$\alpha \in [5; 10]$$
$$mean\_enc_i = \frac{target\_sum_i}{n_i}$$
$$smoothed\_mean\_enc_i = \frac{target\_sum_i + \alpha*global\_mean}{n_i + \alpha}$$
$$\alpha \in [5; 10]$$
| Train ID | Categorical | Target |
|---|---|---|
| 1 | A | 1 |
| 2 | B | 0 |
| 3 | B | 0 |
| 4 | A | 0 |
| 5 | B | 1 |
| Test ID | Categorical | Target | Mean encoded |
|---|---|---|---|
| 10 | A | ? | 0.43 |
| 11 | B | ? | 0.38 |
| 12 | C | ? | 0.40 |
Winning a Kaggle Competition in Python