Encoding categorical variables

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

Categorical variables

   user subscribed fav_color
0     1          y      blue
1     2          n     green
2     3          n    orange
3     4          y     green

Encoding binary variables - pandas

print(users["subscribed"])

0    y
1    n
2    n
3    y
Name: subscribed, dtype: object

print(users[["subscribed", "sub_enc"]])

  subscribed  sub_enc
0          y        1
1          n        0
2          n        0
3          y        1

users["sub_enc"] = users["subscribed"].apply(lambda val: 1 if val == "y" else 0)

Encoding binary variables - scikit-learn

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
users["sub_enc_le"] = le.fit_transform(users["subscribed"])


print(users[["subscribed", "sub_enc_le"]])

  subscribed  sub_enc_le
0          y           1
1          n           0
2          n           0
3          y           1

One-hot encoding

fav_color
blue
green
orange
green

Values: [blue, green, orange]

blue: [1, 0, 0]
green: [0, 1, 0]
orange: [0, 0, 1]

fav_color_enc
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
[0, 1, 0]

print(users["fav_color"])

0      blue
1     green
2    orange
3     green
Name: fav_color, dtype: object

print(pd.get_dummies(users["fav_color"]))

   blue  green  orange
0     1      0       0
1     0      1       0
2     0      0       1
3     0      1       0

Let's practice!

Preprocessing for Machine Learning in Python