One-hot encoding

Working with Categorical Data in Python

Kasey Jones

Research Data Scientist

Why not just label encoding?

used_cars["engine_fuel"] = used_cars["engine_fuel"].astype("category")
codes = used_cars["engine_fuel"].cat.codes
categories = used_cars["engine_fuel"]
dict(zip(codes, categories))
{3: 'gasoline',
 2: 'gas',
 0: 'diesel',
 5: 'hybrid-petrol',
 4: 'hybrid-diesel',
 1: 'electric'}
Working with Categorical Data in Python

One-hot encoding with pandas

pd.get_dummies()

  • data: a pandas DataFrame
  • columns: a list-like object of column names
  • prefix: a string to add to the beginning of each category
Working with Categorical Data in Python

One-hot encoding on a DataFrame

used_cars[["odometer_value", "color"]].head()

Example output:

   odometer_value   color
0          190000  silver
1          290000    blue
2          402000     red
3           10000    blue
4          280000   black
...
Working with Categorical Data in Python

One-hot encoding on a DataFrame continued

used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]])

used_cars_onehot.head()
   odometer_value  color_black  color_brown  color_green ...
0          190000            0            0            0 ...
1          290000            0            0            0 ...
2          402000            0            0            0 ...
3           10000            0            0            0 ...
4          280000            1            0            0 ...
print(used_cars_onehot.shape)
(38531, 13)
Working with Categorical Data in Python

Specifying columns to use

used_cars_onehot = pd.get_dummies(used_cars, columns=["color"], prefix="")
used_cars_onehot.head()
      manufacturer_name ...  _black  _blue  _brown
0                Subaru ...       0      0       0
1                Subaru ...       0      1       0
2                Subaru ...       0      0       0
3                Subaru ...       0      1       0
4                Subaru ...       1      0       0
print(used_cars_onehot.shape)
(38531, 41)
Working with Categorical Data in Python

A few quick notes

  • Might create too many features
used_cars_onehot = pd.get_dummies(used_cars)
print(used_cars_onehot.shape)
(38531, 1240)
  • NaN values do not get their own column
Working with Categorical Data in Python

One-hot encoding practice

Working with Categorical Data in Python

Preparing Video For Download...