Using Categorical and Enum dtypes

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Encoding repeated strings

Diagram of a table with an event title column and an area column with repeated strings.

Scaling and Optimizing Data Pipelines with Polars

Encoding repeated strings

Diagram of a table with an event title column and an area column where the strings are encoded by integers.

Scaling and Optimizing Data Pipelines with Polars

A dataset with repeated labels

events = pl.read_parquet("chicago_events.parquet")
Scaling and Optimizing Data Pipelines with Polars

A dataset with repeated labels

events["area"].value_counts(sort=True).head(5)
shape: (5, 2)
| area          | count |
| ---           | ---   |
| str           | u32   |
|-------------  |-------|
| Loop          | 35982 |
| West Loop     | 27432 |
| Andersonville | 24660 |
| Downtown      | 24320 |
| Pullman       | 21985 |
Scaling and Optimizing Data Pipelines with Polars

Creating a Categorical column

events_cat = events.with_columns(
    pl.col("area").cast(pl.Categorical)
)
Scaling and Optimizing Data Pipelines with Polars

Creating a Categorical column

events_cat.select("event_title","area").head()
shape: (5, 2)
| event_title        | area          |
| ---                | ---           |
| str                | cat           |
|--------------------|---------------|
| Folk Festival      | Andersonville |
| Fireworks Night    | Downtown      |
| Greektown Market   | West Side     |
| Rail History Day   | Pullman       |
| Grant Park Concert | Loop          |
Scaling and Optimizing Data Pipelines with Polars

Creating a Categorical column

events_cat.select("event_title","area").with_columns(
    pl.col("area").to_physical().alias("physical")
).head(3)
shape: (3, 3)
| event_title        | area          | physical |
| ---                | ---           | ---      |
| str                | cat           | u32      |
|--------------------|---------------|----------|
| Folk Festival      | Andersonville | 0        |
| Fireworks Night    | Downtown      | 1        |
| Greektown Market   | West Side     | 2        |
Scaling and Optimizing Data Pipelines with Polars

Using a categorical expression

events_cat.select("event_title","area",
    pl.col("area").cat.starts_with("West").alias("westside"),
)
Scaling and Optimizing Data Pipelines with Polars

Using a categorical expression

events_cat.select("event_title","area",
    pl.col("area").cat.starts_with("West").alias("westside"),
).filter(pl.col("westside")).head(2)
shape: (2, 3)
| event_title             | area      | westside |
| ---                     | ---       | ---      |
| str                     | cat       | bool     |
|-------------------------|-----------|----------|
| Greektown Market        | West Side | true     |
| West Loop Chef Showcase | West Loop | true     |
  • Other string ops → cast back to String
Scaling and Optimizing Data Pipelines with Polars

Categorical and Enum

Diagram of a table with an event title column and an area column where the strings are encoded by integers.

Scaling and Optimizing Data Pipelines with Polars

Creating an Enum column

area_enum = pl.Enum([
    "Albany Park",
    "Andersonville",
    ...
    "West Loop",
    "West Side",
    "Wicker Park",
])
Scaling and Optimizing Data Pipelines with Polars

Creating an Enum column

events_enum = events.with_columns(
    pl.col("area").cast(area_enum)
)
Scaling and Optimizing Data Pipelines with Polars

Creating an Enum column

events_enum.select("event_title", "area").head()
shape: (5, 2)
| event_title                  | area           |
| ---                          | ---            |
| str                          | enum           |
|------------------------------|----------------|
| Greektown Market             | West Side      |
| Folk Festival                | Andersonville  |
| West Loop Chef Showcase      | West Loop      |
| Rail History Day             | Pullman        |
| Fireworks Night              | Downtown       |
Scaling and Optimizing Data Pipelines with Polars

Inspecting the Enum dtype

events_enum.schema["area"]
Enum(categories=['Albany Park', 'Andersonville', 'Chinatown', 'Downtown', ...])
  • Enum for fixed vocabulary
  • Categorical for variable vocabulary
Scaling and Optimizing Data Pipelines with Polars

Enum memory efficiency

events_enum.select("event_title","area").with_columns(
    pl.col("area").to_physical().alias("physical")
).head(3)
shape: (3, 3)
| event_title        | area          | physical |
| ---                | ---           | ---      |
| str                | enum          | u8       |
|--------------------|---------------|----------|
| Folk Festival      | Andersonville | 1        |
| Fireworks Night    | Downtown      | 3        |
| Greektown Market   | West Side     | 16       |
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...