Effective query execution

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Meet your instructor

$$

$$

  • Liam Brannigan, Lead Data Scientist
  • ML and Data Engineering Specialist
  • Polars contributor

Picture of Liam Brannigan - the instructor of this course

Scaling and Optimizing Data Pipelines with Polars

From laptop to cloud

Polars pipeline scaling from laptop to cloud

Scaling and Optimizing Data Pipelines with Polars

Is this course for you?

Polars course pages

  • Creating a lazy query with scan_csv
  • Writing expressions with filter, select, and group_by
Scaling and Optimizing Data Pipelines with Polars

Chapter 1 - optimization

Query optimization

Scaling and Optimizing Data Pipelines with Polars

Chapter 1 - optimization

Query optimization

Scaling and Optimizing Data Pipelines with Polars

Chapter 2 - effective I/O

Image of Polars ingesting multiple file types

Scaling and Optimizing Data Pipelines with Polars

Chapter 3 - richer dtypes

Diagram of nested data transformed into a table.

Scaling and Optimizing Data Pipelines with Polars

Chapter 4 - scaling pipelines

Image of large table being streamed.

Scaling and Optimizing Data Pipelines with Polars

Chicago requests dataset

The Mayor of Chicago's data analytics team

Scaling and Optimizing Data Pipelines with Polars

Inspecting the dataset

requests = pl.scan_csv("311_Service_Requests.csv",try_parse_dates=True)

A massive dataset, recording every service request from Chicago's citizens

Scaling and Optimizing Data Pipelines with Polars

Inspecting the dataset

requests.collect()
Scaling and Optimizing Data Pipelines with Polars

Inspecting the dataset

requests.collect().head(5)
shape: (5, 39)
| TYPE                          | STATUS    | DEPARTMENT     | CREATED_DATE        | ... |
| ---                           | ---       | ---            | ---                 | --- |
| str                           | str       | str            | str                 | ... |
|-------------------------------|-----------|----------------|---------------------|-----|
| Pothole in Street Complaint   | Completed | Transportation | 2019-12-16T10:09:08 | ... |
| Tree Trim Request             | Cancelled | Sanitation     | 2019-09-18T01:05:08 | ... |
| Garbage Cart Maintenance      | Completed | Sanitation     | 2021-01-24T09:14:58 | ... |
| Pothole in Street Complaint   | Completed | Transportation | 2019-03-21T10:41:01 | ... |
| Recycling Pick Up             | Completed | Sanitation     | 2021-02-16T08:28:59 | ... |
Scaling and Optimizing Data Pipelines with Polars

Limiting rows processed

requests.head(5)
Scaling and Optimizing Data Pipelines with Polars

Limiting rows processed

requests.head(5).collect()
shape: (5, 39)
| TYPE                            | STATUS    | DEPARTMENT     | CREATED_DATE        | ... |
| ---                             | ---       | ---            | ---                 | --- |
| str                             | str       | str            | str                 | ... |
|---------------------------------|-----------|----------------|---------------------|-----|
| Pothole in Street Complaint     | Completed | Transportation | 2019-12-16T10:09:08 | ... |
| Tree Trim Request | Completed | Sanitation     | 2019-09-18T01:05:08 | ... |
| Garbage Cart Maintenance        | Completed | Sanitation     | 2021-01-24T09:14:58 | ... |
| Pothole in Street Complaint     | Completed | Transportation | 2019-03-21T10:41:01 | ... |
| Recycling Pick Up               | Completed | Sanitation     | 2021-02-16T08:28:59 | ... |
  • Delay calling collect
Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department


Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department = requests


Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department = requests.filter(
    pl.col("STATUS") == "Completed"
)
Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department = requests.filter(
    pl.col("STATUS") == "Completed"
).collect()
Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department = requests.filter(
    pl.col("STATUS") == "Completed"
).collect().group_by("DEPARTMENT").len()
Scaling and Optimizing Data Pipelines with Polars

Department summary query

completed_by_department = requests.filter(
    pl.col("STATUS") == "Completed"
).group_by("DEPARTMENT").len().collect()
shape: (10, 2)
| DEPARTMENT                     | len     |
| ---                            | ---     |
| str                            | u32     |
|-------------------------------|---------|
| 311 City Services              | 4859161 |
| Sanitation                     | 3406631 |
| Aviation                       | 2337842 |
| CDOT - Department of Transport | 1468240 |
Scaling and Optimizing Data Pipelines with Polars

Diverging query branches

completed_by_department = requests.filter(
    pl.col("STATUS") == "Completed"
).group_by("DEPARTMENT").len()
completed_by_month = requests.filter(
    pl.col("STATUS") == "Completed"
).group_by("MONTH").len()
Scaling and Optimizing Data Pipelines with Polars

How the team runs this today

completed_by_department.collect()
completed_by_month.collect()
Scaling and Optimizing Data Pipelines with Polars

Executing diverging queries

results = pl.collect_all(


)
Scaling and Optimizing Data Pipelines with Polars

Executing diverging queries

results = pl.collect_all([
    completed_by_department,
    completed_by_month,
])
Scaling and Optimizing Data Pipelines with Polars

Executing diverging queries

results[0]  # completed_by_department
shape: (10, 2)
| DEPARTMENT           | len     |
| ---                  | ---     |
| str                  | u32     |
|----------------------|---------|
| 311 City Services    | 4859161 |
| Sanitation           | 3406631 |
| Aviation             | 2337842 |
| Transport            | 1468240 |
results[1]  # completed_by_month
shape: (12, 2)
| MONTH | len  |
| ---   | ---  |
| i64   | u32  |
|-------|------|
| 1     | 2506 |
| 2     | 4566 |
| 3     | 2739 |
| 4     | 2922 |
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...