Werken met categorische data in Python
Kasey Jones
Research Data Scientist
import pandas as pd
used_cars = pd.read_csv("used_cars.csv")
used_cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38531 entries, 0 to 38530
Data columns (total 30 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 manufacturer_name 38531 non-null object
1 model_name 38531 non-null object
2 transmission 38531 non-null object
...
used_cars['manufacturer_name'].describe()
count 38531
unique 55
top Volkswagen
freq 4243
Name: manufacturer_name, dtype: object
print("As object: ", used_cars['manufacturer_name'].nbytes)
print("As category: ", used_cars['manufacturer_name'].astype('category').nbytes)
As object: 308248
As category: 38971
used_cars['odometer_value'].astype('object').describe()
count 38531
unique 6063
top 300000
freq 1794
Name: odometer_value, dtype: int64
print(f"As float: {used_cars['odometer_value'].nbytes}")
print(f"As category: {used_cars['odometer_value'].astype('category').nbytes}")
As float: 308248
As category: 125566
.str-accessor gebruiken om data te bewerken zet de Series om naar object..apply() levert een nieuwe Series als object.Controleer
used_cars["color"] = used_cars["color"].astype("category")
used_cars["color"] = used_cars["color"].str.upper()
print(used_cars["color"].dtype)
object
Converteer
used_cars["color"] = used_cars["color"].astype("category")
print(used_cars["color"].dtype)
category
Stel categorieën in
used_cars["color"] = used_cars["color"].astype("category") used_cars["color"].cat.set_categories(["black", "silver", "blue"], inplace=True)used_cars["color"].value_counts(dropna=False)
NaN 18172
black 7705
silver 6852
blue 5802
Name: color, dtype: int64
used_cars['number_of_photos'] = used_cars['number_of_photos'].astype("category")
used_cars['number_of_photos'].sum() # <--- Geeft een fout
TypeError: Categorical cannot perform the operation sum
used_cars['number_of_photos'].astype(int).sum()
Let op:
# .str zet de kolom om naar een array
used_cars["color"].str.contains("red")
0 False
1 False
...
Werken met categorische data in Python