Scaling and Optimizing Data Pipelines with Polars
Liam Brannigan
Data Scientist & Polars Contributor


events = pl.read_parquet("chicago_events.parquet")
events["area"].value_counts(sort=True).head(5)
shape: (5, 2)
| area | count |
| --- | --- |
| str | u32 |
|------------- |-------|
| Loop | 35982 |
| West Loop | 27432 |
| Andersonville | 24660 |
| Downtown | 24320 |
| Pullman | 21985 |
events_cat = events.with_columns(
pl.col("area").cast(pl.Categorical)
)
events_cat.select("event_title","area").head()
shape: (5, 2)
| event_title | area |
| --- | --- |
| str | cat |
|--------------------|---------------|
| Folk Festival | Andersonville |
| Fireworks Night | Downtown |
| Greektown Market | West Side |
| Rail History Day | Pullman |
| Grant Park Concert | Loop |
events_cat.select("event_title","area").with_columns(
pl.col("area").to_physical().alias("physical")
).head(3)
shape: (3, 3)
| event_title | area | physical |
| --- | --- | --- |
| str | cat | u32 |
|--------------------|---------------|----------|
| Folk Festival | Andersonville | 0 |
| Fireworks Night | Downtown | 1 |
| Greektown Market | West Side | 2 |
events_cat.select("event_title","area",
pl.col("area").cat.starts_with("West").alias("westside"),
)
events_cat.select("event_title","area",
pl.col("area").cat.starts_with("West").alias("westside"),
).filter(pl.col("westside")).head(2)
shape: (2, 3)
| event_title | area | westside |
| --- | --- | --- |
| str | cat | bool |
|-------------------------|-----------|----------|
| Greektown Market | West Side | true |
| West Loop Chef Showcase | West Loop | true |

area_enum = pl.Enum([
"Albany Park",
"Andersonville",
...
"West Loop",
"West Side",
"Wicker Park",
])
events_enum = events.with_columns(
pl.col("area").cast(area_enum)
)
events_enum.select("event_title", "area").head()
shape: (5, 2)
| event_title | area |
| --- | --- |
| str | enum |
|------------------------------|----------------|
| Greektown Market | West Side |
| Folk Festival | Andersonville |
| West Loop Chef Showcase | West Loop |
| Rail History Day | Pullman |
| Fireworks Night | Downtown |
events_enum.schema["area"]
Enum(categories=['Albany Park', 'Andersonville', 'Chinatown', 'Downtown', ...])
events_enum.select("event_title","area").with_columns(
pl.col("area").to_physical().alias("physical")
).head(3)
shape: (3, 3)
| event_title | area | physical |
| --- | --- | --- |
| str | enum | u8 |
|--------------------|---------------|----------|
| Folk Festival | Andersonville | 1 |
| Fireworks Night | Downtown | 3 |
| Greektown Market | West Side | 16 |
Scaling and Optimizing Data Pipelines with Polars