Working with Categorical Data in Python
Kasey Jones
Research Data Scientist
dogs.info()
RangeIndex: 2937 entries, 0 to 2936, Data columns (total 19 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 ID 2937 non-null int64
...
8 color 2937 non-null object
9 coat 2937 non-null object
...
17 get_along_cats 431 non-null object
18 keep_in 1916 non-null object
dtypes: float64(1), int64(1), object(17)
memory usage: 436.1+ KB
...
dogs["coat"] = dogs["coat"].astype("category")
dogs["coat"].value_counts(dropna=False)
short 1972
medium 565
wirehaired 220
long 180
Name: coat, dtype: int64
Series.cat.method_name
Common parameters:
new_categories
: a list of categoriesinplace
: Boolean - whether or not the update should overwrite the Seriesordered
: Boolean - whether or not the categorical is treated as an ordered categoricalSet categories:
dogs["coat"] = dogs["coat"].cat.set_categories(
new_categories=["short", "medium", "long"]
)
Check value counts:
dogs["coat"].value_counts(dropna=False)
short 1972
medium 565
NaN 220
long 180
dogs["coat"] = dogs["coat"].cat.set_categories(
new_categories=["short", "medium", "long"],
ordered=True
)
dogs["coat"].head(3)
0 short
1 short
2 short
Name: coat, dtype: category
Categories (3, object): ['short' < 'medium' < 'long']
dogs["likes_people"].value_counts(dropna=False)
yes 1991
NaN 938
no 8
A NaN
could mean:
Add categories
dogs["likes_people"] = dogs["likes_people"].astype("category")
dogs["likes_people"] = dogs["likes_people"].cat.add_categories(
new_categories=["did not check", "could not tell"]
)
Check categories:
dogs["likes_people"].cat.categories
Index(['no', 'yes', 'did not check', 'could not tell'], dtype='object')
dogs["likes_people"].value_counts(dropna=False)
yes 1991
NaN 938
no 8
could not tell 0
did not check 0
dogs["coat"] = dogs["coat"].astype("category")
dogs["coat"] = dogs["coat"].cat.remove_categories(removals=["wirehaired"])
Check the categories:
dogs["coat"].cat.categories
Index(['long', 'medium', 'short'], dtype='object')
cat.set_categories()
cat.add_categories()
cat.remove_categories()
NaN
Working with Categorical Data in Python