Werken met categorische data in Python
Kasey Jones
Research Data Scientist
Aantallen per ras:
dogs["breed"] = dogs["breed"].astype("category")
dogs["breed"].value_counts()
Unknown Mix 1524
German Shepherd Dog Mix 190
Dachshund Mix 147
Labrador Retriever Mix 83
Staffordshire Terrier Mix 62
...
De methode rename_categories:
Series.cat.rename_categories(new_categories=dict)
Maak een dictionary:
my_changes = {"Unknown Mix": "Unknown"}
Hernoem de categorie:
dogs["breed"] = dogs["breed"].cat.rename_categories(my_changes)
Aantallen per ras:
dogs["breed"].value_counts()
Unknown 1524
German Shepherd Dog Mix 190
Dachshund Mix 147
Labrador Retriever Mix 83
Staffordshire Terrier Mix 62
...
Meerdere wijzigingen tegelijk:
my_changes = {
old_name1: new_name1,
old_name2: new_name2,
...
}
Series.cat.rename_categories(
my_changes
)
Meerdere categorieën updaten:
dogs['sex'] = dogs['sex'].cat.rename_categories(lambda c: c.title())dogs['sex'].cat.categories
Index(['Female', 'Male'], dtype='object')
# Werkt niet! "Unknown" bestaat al
use_new_categories = {"Unknown Mix": "Unknown"}
# Werkt niet! Nieuwe namen moeten uniek zijn
cannot_repeat_categories = {
"Unknown Mix": "Unknown",
"Mixed Breed": "Unknown"
}
De kleur van een hond:
dogs["color"] = dogs["color"].astype("category")
print(dogs["color"].cat.categories)
Index(['apricot', 'black', 'black and brown', 'black and tan',
'black and white', 'brown', 'brown and white', 'dotted', 'golden',
'gray', 'gray and black', 'gray and white', 'red', 'red and white',
'sable', 'saddle back', 'spotty', 'striped', 'tricolor', 'white',
'wild boar', 'yellow', 'yellow-brown'],
dtype='object')
...
Maak een dictionary en gebruik .replace:
update_colors = {
"black and brown": "black",
"black and tan": "black",
"black and white": "black",
}
dogs["main_color"] = dogs["color"].replace(update_colors)
Check het dtype van de Series:
dogs["main_color"].dtype
dtype('O')
dogs["main_color"] = dogs["main_color"].astype("category")
dogs["main_color"].cat.categories
Index(['apricot', 'black', 'brown', 'brown and white', 'dotted', 'golden',
'gray', 'gray and black', 'gray and white', 'red', 'red and white',
'sable', 'saddle back', 'spotty', 'striped', 'tricolor', 'white',
'wild boar', 'yellow', 'yellow-brown'],
dtype='object')
Werken met categorische data in Python