Working with Categorical Data in Python
Kasey Jones
Research Data Scientist
used_cars["engine_fuel"] = used_cars["engine_fuel"].astype("category")
codes = used_cars["engine_fuel"].cat.codes
categories = used_cars["engine_fuel"]
dict(zip(codes, categories))
{3: 'gasoline',
2: 'gas',
0: 'diesel',
5: 'hybrid-petrol',
4: 'hybrid-diesel',
1: 'electric'}
pd.get_dummies()
data
: a pandas
DataFramecolumns
: a list-like object of column namesprefix
: a string to add to the beginning of each categoryused_cars[["odometer_value", "color"]].head()
Example output:
odometer_value color
0 190000 silver
1 290000 blue
2 402000 red
3 10000 blue
4 280000 black
...
used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]])
used_cars_onehot.head()
odometer_value color_black color_brown color_green ...
0 190000 0 0 0 ...
1 290000 0 0 0 ...
2 402000 0 0 0 ...
3 10000 0 0 0 ...
4 280000 1 0 0 ...
print(used_cars_onehot.shape)
(38531, 13)
used_cars_onehot = pd.get_dummies(used_cars, columns=["color"], prefix="")
used_cars_onehot.head()
manufacturer_name ... _black _blue _brown
0 Subaru ... 0 0 0
1 Subaru ... 0 1 0
2 Subaru ... 0 0 0
3 Subaru ... 0 1 0
4 Subaru ... 1 0 0
print(used_cars_onehot.shape)
(38531, 41)
used_cars_onehot = pd.get_dummies(used_cars)
print(used_cars_onehot.shape)
(38531, 1240)
NaN
values do not get their own columnWorking with Categorical Data in Python