Encoding categorical variables

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

Categorical variables

   user subscribed fav_color
0     1          y      blue
1     2          n     green
2     3          n    orange
3     4          y     green
Preprocessing for Machine Learning in Python

Encoding binary variables - pandas

print(users["subscribed"])
0    y
1    n
2    n
3    y
Name: subscribed, dtype: object
print(users[["subscribed", "sub_enc"]])
  subscribed  sub_enc
0          y        1
1          n        0
2          n        0
3          y        1

 

users["sub_enc"] = users["subscribed"].apply(lambda val: 1 if val == "y" else 0)
Preprocessing for Machine Learning in Python

Encoding binary variables - scikit-learn

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() users["sub_enc_le"] = le.fit_transform(users["subscribed"])
print(users[["subscribed", "sub_enc_le"]])
  subscribed  sub_enc_le
0          y           1
1          n           0
2          n           0
3          y           1
Preprocessing for Machine Learning in Python

One-hot encoding

fav_color
blue
green
orange
green

Values: [blue, green, orange]

  • blue: [1, 0, 0]
  • green: [0, 1, 0]
  • orange: [0, 0, 1]
fav_color_enc
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
[0, 1, 0]
Preprocessing for Machine Learning in Python
print(users["fav_color"])
0      blue
1     green
2    orange
3     green
Name: fav_color, dtype: object
print(pd.get_dummies(users["fav_color"]))
   blue  green  orange
0     1      0       0
1     0      1       0
2     0      0       1
3     0      1       0
Preprocessing for Machine Learning in Python

Let's practice!

Preprocessing for Machine Learning in Python

Preparing Video For Download...