Handling array data

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Array data

Diagram of a table with an event title column and an array column of tags

Scaling and Optimizing Data Pipelines with Polars

Events dataset

events = pl.read_parquet("chicago_events.parquet")
Scaling and Optimizing Data Pipelines with Polars

Events dataset

events.select(
    "event_date", "event_title", "tags", "area"
).head()
shape: (5, 4)
| event_date | event_title      | tags                                | area          |
| ---        | ---              | ---                                 | ---           |
| date       | str              | list[str]                           | str           |
|------------|------------------|-------------------------------------|---------------|
| 2025-06-14 | Chef Showcase    | ["food", "chef_demo"]               | West Loop     |
| 2025-06-29 | Folk Festival    | ["music", "neighborhood", "crafts"] | Andersonville |
| 2025-05-25 | Fireworks Night  | ["nightlife", "family"]             | Downtown      |
| 2025-09-20 | Rail History Day | ["history", "family", "education"]  | Pullman       |
| 2025-07-04 | Grant Park Concert | ["music", "holiday", "family"]    | Loop          |
Scaling and Optimizing Data Pipelines with Polars

Event tags

events["tags"][0]
shape: (2,)
Series: '' [str]
[
    "food"
    "chef_demo"
]
Scaling and Optimizing Data Pipelines with Polars

Getting the primary tag

events.select(
    "event_title",
    "tags",

)
Scaling and Optimizing Data Pipelines with Polars

Getting the primary tag

events.select(
    "event_title",
    "tags",
    pl.col("tags").list.get(0),
)
Scaling and Optimizing Data Pipelines with Polars

Getting the primary tag

events.select(
    "event_title",
    "tags",
    pl.col("tags").list.get(0).alias("primary_tag"),
)
Scaling and Optimizing Data Pipelines with Polars

Getting the primary tag

events.select(
    "event_title",
    "tags",
    pl.col("tags").list.get(0).alias("primary_tag"),
).head(3)
shape: (3, 3)
| event_title        | tags                                | primary_tag |
| ---                | ---                                 | ---         |
| str                | list[str]                           | str         |
|--------------------|-------------------------------------|-------------|
| Chef Showcase      | ["food", "chef_demo"]               | food        |
| Folk Festival      | ["music", "neighborhood", "crafts"] | music       |
| Fireworks Night    | ["nightlife", "family"]             | nightlife   |
Scaling and Optimizing Data Pipelines with Polars

Parsing event features

events.select(
    "event_title","tags",


)
Scaling and Optimizing Data Pipelines with Polars

Parsing event features

events.select(
    "event_title","tags",
    pl.col("tags").list.len().alias("tag_count"),

)
Scaling and Optimizing Data Pipelines with Polars

Parsing event features

events.select(
    "event_title","tags",
    pl.col("tags").list.len().alias("tag_count"),
    pl.col("tags").list.contains("family").alias("has_family"),
)
Scaling and Optimizing Data Pipelines with Polars

Family-friendly events

events.select(
    "event_title","tags",
    pl.col("tags").list.len().alias("tag_count"),
    pl.col("tags").list.contains("family").alias("has_family"),
).filter(pl.col("has_family")).head(3)
shape: (3, 4)
| event_title        | tags                               | tag_count | has_family |
| ---                | ---                                | ---       | ---        |
| str                | list[str]                          | u32       | bool       |
|--------------------|------------------------------------|-----------|------------|
| Fireworks Night    | ["nightlife", "family"]            | 2         | true       |
| Rail History Day   | ["history", "family", "education"] | 3         | true       |
| Grant Park Concert | ["music", "holiday", "family"]     | 3         | true       |
Scaling and Optimizing Data Pipelines with Polars

Polars list expressions

  • list.first()index

$$

  • list.unique()deduplicate

$$

  • list.mean()numeric aggregation

$$

  • Many more in the docs 🔥
1 https://docs.pola.rs/api/python/stable/reference/expressions/list.html
Scaling and Optimizing Data Pipelines with Polars

Exploding the tags column

events.select("event_title","tags")
Scaling and Optimizing Data Pipelines with Polars

Exploding the tags column

events.select("event_title","tags").explode("tags")
Scaling and Optimizing Data Pipelines with Polars

Exploded tags

events.select("event_title","tags").explode("tags")
shape: (1346328, 2)
| event_title        | tags         |
| ---                | ---          |
| str                | str          |
|--------------------|------------- |
| Chef Showcase      | food         |
| Chef Showcase      | chef_demo    |
| Folk Festival      | music        |
| Folk Festival      | neighborhood |
| Folk Festival      | crafts       |
| Fireworks Night    | nightlife    |
| ...                | ...          |
Scaling and Optimizing Data Pipelines with Polars

Counting tag popularity

events.select("event_title","tags").explode("tags")["tags"].value_counts(sort=True)
Scaling and Optimizing Data Pipelines with Polars

Counting tag popularity

events.select("event_title","tags").explode("tags")["tags"].value_counts(sort=True).head()
shape: (5, 2)
| tags      | count   |
| ---       | ---     |
| str       | u32     |
|-----------|---------|
| family    | 59864   |
| nightlife | 45302   |
| music     | 35612   |
| food      | 23948   |
| culture   | 21566   |
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...