Working with nested columns

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Nested column data

Diagram of a table with a nested data column.

Scaling and Optimizing Data Pipelines with Polars

Venue context data

events.select("event_title", "area", "venue_context").head()
shape: (5, 3)
| event_title                 | area          | venue_context                  |
| ---                         | ---           | ---                            |
| str                         | str           | struct[2]                      |
|-----------------------------|---------------|--------------------------------|
| Greektown Market            | West Side     | {"Market","OpenAir"}           |
| Folk Festival               | Andersonville | {"Festival","OpenAir"}         |
| West Loop Chef Showcase     | West Loop     | {"Food Hall","Enclosed"}       |
| Rail History Day            | Pullman       | {"Historic Site","Mixed"}      |
| Fireworks Night             | Downtown      | {"Pier","Mixed"}               |
Scaling and Optimizing Data Pipelines with Polars

Inspecting the nested schema

events.schema
event_date: Date
event_title: String
tags: List(String)
area: String
venue_context: Struct({'venue_type': String, 'venue_space': String})
visitors: Int64
profile: Int64
price: Float64
Scaling and Optimizing Data Pipelines with Polars

Venue context fields

events["venue_context"].struct.fields
['venue_type', 'venue_space']
Scaling and Optimizing Data Pipelines with Polars

Venue context values

events["venue_context"][0]
{'venue_type': 'Market', 'venue_space': 'OpenAir'}
Scaling and Optimizing Data Pipelines with Polars

Venue context values

events["venue_context"].struct["venue_type"]
shape: (105762,)
Series: 'venue_type' [str]
[
    "Market"
    "Festival"
    "Historic Site"
    "Pier"
    "Gallery"
    ...
]
Scaling and Optimizing Data Pipelines with Polars

Renaming venue context fields

events.select(pl.col("venue_context").struct.rename_fields([               ]))
Scaling and Optimizing Data Pipelines with Polars

Renaming venue context fields

events.select(pl.col("venue_context").struct.rename_fields(["type", "space"]))
shape: (5, 1)
| venue_context                  |
| ---                            |
| struct[2]                      |
|--------------------------------|
| {"Market","OpenAir"}           |
| {"Festival","OpenAir"}         |
| {"Food Hall","Enclosed"}       |
| {"Historic Site","Mixed"}      |
| {"Pier","Mixed"}               |
Scaling and Optimizing Data Pipelines with Polars

Unnesting the venue context

events.with_columns(
  pl.col("venue_context").struct.rename_fields(["type", "space"])
)
Scaling and Optimizing Data Pipelines with Polars

Unnesting the venue context

events.with_columns(
  pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context")
Scaling and Optimizing Data Pipelines with Polars

Unnesting the venue context

events.with_columns(
  pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context").filter(pl.col("type") == "Gallery")
Scaling and Optimizing Data Pipelines with Polars

Unnesting the venue context

events.with_columns(
  pl.col("venue_context").struct.rename_fields(["type", "space"])
).unnest("venue_context").filter(pl.col("type") == "Gallery").head()
shape: (5, 4)
| event_title                    | area             | type    | space    |
| ---                            | ---              | ---     | ---      |
| str                            | str              | str     | str      |
|--------------------------------|------------------|---------|----------|
| Wicker Park Design Weekend     | Wicker Park      | Gallery | Enclosed |
| River North First Friday       | North Side       | Gallery | Enclosed |
| Pilsen Studio Open House       | Pilsen           | Gallery | Mixed    |
| West Loop Photo Biennial       | West Loop        | Gallery | Enclosed |
| South Side Emerging Artists    | South Side       | Gallery | Enclosed |
Scaling and Optimizing Data Pipelines with Polars

Creating an object dtype

def title_to_word_set(title):
    return set(title.lower().split())
Scaling and Optimizing Data Pipelines with Polars

Creating an object dtype

def title_to_word_set(title):
    return set(title.lower().split())

events.with_columns(
    pl.col("event_title").map_elements(
        title_to_word_set,


)
Scaling and Optimizing Data Pipelines with Polars

Creating an object dtype

def title_to_word_set(title):
    return set(title.lower().split())

events.with_columns(
    pl.col("event_title").map_elements(
        title_to_word_set,
        return_dtype=pl.Object,
    ).alias("title_word_set")
)
Scaling and Optimizing Data Pipelines with Polars

Creating an object dtype

shape: (8, 2)
| event_title               | title_word_set                   |
| ---                       | ---                              |
| str                       | object                           |
|---------------------------|----------------------------------|
| Greektown Market          | {'market', 'greektown'}          |
| Folk Festival             | {'festival', 'folk'}             |
| Rail History Day          | {'day', 'history', 'rail'}       |
| Fireworks Night           | {'night', 'fireworks'}           |
| Grant Park Picnic         | {'park', 'grant', 'picnic'}      |
| Jazz on the River         | {'jazz', 'on', 'the', 'river'}   |
| Chef Showcase             | {'chef', 'showcase'}             |
| Story Marathon            | {'story', 'marathon'}            |
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...