Scaling and Optimizing Data Pipelines with Polars
Liam Brannigan
Data Scientist & Polars Contributor

events.select("event_title", "area", "venue_context").head()
shape: (5, 3)
| event_title | area | venue_context |
| --- | --- | --- |
| str | str | struct[2] |
|-----------------------------|---------------|--------------------------------|
| Greektown Market | West Side | {"Market","OpenAir"} |
| Folk Festival | Andersonville | {"Festival","OpenAir"} |
| West Loop Chef Showcase | West Loop | {"Food Hall","Enclosed"} |
| Rail History Day | Pullman | {"Historic Site","Mixed"} |
| Fireworks Night | Downtown | {"Pier","Mixed"} |
events.schema
event_date: Date
event_title: String
tags: List(String)
area: String
venue_context: Struct({'venue_type': String, 'venue_space': String})
visitors: Int64
profile: Int64
price: Float64
events["venue_context"].struct.fields
['venue_type', 'venue_space']
events["venue_context"][0]
{'venue_type': 'Market', 'venue_space': 'OpenAir'}
events["venue_context"].struct["venue_type"]
shape: (105762,)
Series: 'venue_type' [str]
[
"Market"
"Festival"
"Historic Site"
"Pier"
"Gallery"
...
]
events.select(pl.col("venue_context").struct.rename_fields([ ]))
events.select(pl.col("venue_context").struct.rename_fields(["type", "space"]))
shape: (5, 1)
| venue_context |
| --- |
| struct[2] |
|--------------------------------|
| {"Market","OpenAir"} |
| {"Festival","OpenAir"} |
| {"Food Hall","Enclosed"} |
| {"Historic Site","Mixed"} |
| {"Pier","Mixed"} |
events.with_columns(
pl.col("venue_context").struct.rename_fields(["type", "space"])
)
events.with_columns(
pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context")
events.with_columns(
pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context").filter(pl.col("type") == "Gallery")
events.with_columns(
pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context").filter(pl.col("type") == "Gallery").head()
shape: (5, 4)
| event_title | area | type | space |
| --- | --- | --- | --- |
| str | str | str | str |
|--------------------------------|------------------|---------|----------|
| Wicker Park Design Weekend | Wicker Park | Gallery | Enclosed |
| River North First Friday | North Side | Gallery | Enclosed |
| Pilsen Studio Open House | Pilsen | Gallery | Mixed |
| West Loop Photo Biennial | West Loop | Gallery | Enclosed |
| South Side Emerging Artists | South Side | Gallery | Enclosed |
def title_to_word_set(title):
return set(title.lower().split())
def title_to_word_set(title):
return set(title.lower().split())
events.with_columns(
pl.col("event_title").map_elements(
title_to_word_set,
)
def title_to_word_set(title):
return set(title.lower().split())
events.with_columns(
pl.col("event_title").map_elements(
title_to_word_set,
return_dtype=pl.Object,
).alias("title_word_set")
)
shape: (8, 2)
| event_title | title_word_set |
| --- | --- |
| str | object |
|---------------------------|----------------------------------|
| Greektown Market | {'market', 'greektown'} |
| Folk Festival | {'festival', 'folk'} |
| Rail History Day | {'day', 'history', 'rail'} |
| Fireworks Night | {'night', 'fireworks'} |
| Grant Park Picnic | {'park', 'grant', 'picnic'} |
| Jazz on the River | {'jazz', 'on', 'the', 'river'} |
| Chef Showcase | {'chef', 'showcase'} |
| Story Marathon | {'story', 'marathon'} |
Scaling and Optimizing Data Pipelines with Polars