One-hot encoding

Working with Categorical Data in Python

Kasey Jones

Research Data Scientist

Why not just label encoding?

used_cars["engine_fuel"] = used_cars["engine_fuel"].astype("category")
codes = used_cars["engine_fuel"].cat.codes
categories = used_cars["engine_fuel"]
dict(zip(codes, categories))

{3: 'gasoline',
 2: 'gas',
 0: 'diesel',
 5: 'hybrid-petrol',
 4: 'hybrid-diesel',
 1: 'electric'}

One-hot encoding with pandas

pd.get_dummies()

data: a pandas DataFrame
columns: a list-like object of column names
prefix: a string to add to the beginning of each category

One-hot encoding on a DataFrame

used_cars[["odometer_value", "color"]].head()

Example output:

   odometer_value   color
0          190000  silver
1          290000    blue
2          402000     red
3           10000    blue
4          280000   black
...

One-hot encoding on a DataFrame continued

used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]])

used_cars_onehot.head()

   odometer_value  color_black  color_brown  color_green ...
0          190000            0            0            0 ...
1          290000            0            0            0 ...
2          402000            0            0            0 ...
3           10000            0            0            0 ...
4          280000            1            0            0 ...

print(used_cars_onehot.shape)

(38531, 13)

Specifying columns to use

used_cars_onehot = pd.get_dummies(used_cars, columns=["color"], prefix="")
used_cars_onehot.head()

      manufacturer_name ...  _black  _blue  _brown
0                Subaru ...       0      0       0
1                Subaru ...       0      1       0
2                Subaru ...       0      0       0
3                Subaru ...       0      1       0
4                Subaru ...       1      0       0

print(used_cars_onehot.shape)

(38531, 41)

A few quick notes

Might create too many features

used_cars_onehot = pd.get_dummies(used_cars)
print(used_cars_onehot.shape)

(38531, 1240)

NaN values do not get their own column

One-hot encoding practice

Working with Categorical Data in Python