Working with Categorical Data in Python
Kasey Jones
Research Data Scientist
1) Inconsistent values: "Ham"
, "ham"
, " Ham"
2) Misspelled values: "Ham"
, "Hma"
3) Wrong dtype
: df['Our Column'].dtype
dtype('O')
Use either:
Series.cat.categories
Series.value_counts()
dogs["get_along_cats"].value_counts()
No 2503
yes 275
no 156
Noo 2
NO 1
Removing whitespace: .strip()
dogs["get_along_cats"] = dogs["get_along_cats"].str.strip()
Check the frequency counts:
dogs["get_along_cats"].value_counts()
No 2503
yes 275
no 156
Noo 2
NO 1 # < ---- no more whitespace
Capitalization: .title()
, .upper()
, .lower()
dogs["get_along_cats"] = dogs["get_along_cats"].str.title()
Check the frequency counts:
dogs["get_along_cats"].value_counts()
No 2660
Yes 275
Noo 2
Fixing a typo with .replace()
replace_map = {"Noo": "No"}
dogs["get_along_cats"].replace(replace_map, inplace=True)
Check the frequency counts:
dogs["get_along_cats"].value_counts()
No 2662
Yes 275
Checking the dtype
dogs["get_along_cats"].dtype
dtype('O')
Converting back to a category
dogs["get_along_cats"] = dogs["get_along_cats"].astype("category")
Searching for a string
dogs["breed"].str.contains("Shepherd", regex=False)
0 False
1 False
2 False
...
2935 False
2936 True
Access Series values based on category
dogs.loc[dogs["get_along_cats"] == "Yes", "size"]
Series value counts:
dogs.loc[dogs["get_along_cats"] == "Yes", "size"].value_counts(sort=False)
small 69
medium 169
large 37
Working with Categorical Data in Python