Scaling and Optimizing Data Pipelines with Polars
Liam Brannigan
Data Scientist & Polars Contributor
$$
$$



scan_csvfilter, select, and group_by





requests = pl.scan_csv("311_Service_Requests.csv",try_parse_dates=True)

requests.collect()
requests.collect().head(5)
shape: (5, 39)
| TYPE | STATUS | DEPARTMENT | CREATED_DATE | ... |
| --- | --- | --- | --- | --- |
| str | str | str | str | ... |
|-------------------------------|-----------|----------------|---------------------|-----|
| Pothole in Street Complaint | Completed | Transportation | 2019-12-16T10:09:08 | ... |
| Tree Trim Request | Cancelled | Sanitation | 2019-09-18T01:05:08 | ... |
| Garbage Cart Maintenance | Completed | Sanitation | 2021-01-24T09:14:58 | ... |
| Pothole in Street Complaint | Completed | Transportation | 2019-03-21T10:41:01 | ... |
| Recycling Pick Up | Completed | Sanitation | 2021-02-16T08:28:59 | ... |
requests.head(5)
requests.head(5).collect()
shape: (5, 39)
| TYPE | STATUS | DEPARTMENT | CREATED_DATE | ... |
| --- | --- | --- | --- | --- |
| str | str | str | str | ... |
|---------------------------------|-----------|----------------|---------------------|-----|
| Pothole in Street Complaint | Completed | Transportation | 2019-12-16T10:09:08 | ... |
| Tree Trim Request | Completed | Sanitation | 2019-09-18T01:05:08 | ... |
| Garbage Cart Maintenance | Completed | Sanitation | 2021-01-24T09:14:58 | ... |
| Pothole in Street Complaint | Completed | Transportation | 2019-03-21T10:41:01 | ... |
| Recycling Pick Up | Completed | Sanitation | 2021-02-16T08:28:59 | ... |
collectcompleted_by_department
completed_by_department = requests
completed_by_department = requests.filter(
pl.col("STATUS") == "Completed"
)
completed_by_department = requests.filter(
pl.col("STATUS") == "Completed"
).collect()
completed_by_department = requests.filter(
pl.col("STATUS") == "Completed"
).collect().group_by("DEPARTMENT").len()
completed_by_department = requests.filter(
pl.col("STATUS") == "Completed"
).group_by("DEPARTMENT").len().collect()
shape: (10, 2)
| DEPARTMENT | len |
| --- | --- |
| str | u32 |
|-------------------------------|---------|
| 311 City Services | 4859161 |
| Sanitation | 3406631 |
| Aviation | 2337842 |
| CDOT - Department of Transport | 1468240 |
completed_by_department = requests.filter(
pl.col("STATUS") == "Completed"
).group_by("DEPARTMENT").len()
completed_by_month = requests.filter(
pl.col("STATUS") == "Completed"
).group_by("MONTH").len()
completed_by_department.collect()
completed_by_month.collect()
results = pl.collect_all(
)
results = pl.collect_all([
completed_by_department,
completed_by_month,
])
results[0] # completed_by_department
shape: (10, 2)
| DEPARTMENT | len |
| --- | --- |
| str | u32 |
|----------------------|---------|
| 311 City Services | 4859161 |
| Sanitation | 3406631 |
| Aviation | 2337842 |
| Transport | 1468240 |
results[1] # completed_by_month
shape: (12, 2)
| MONTH | len |
| --- | --- |
| i64 | u32 |
|-------|------|
| 1 | 2506 |
| 2 | 4566 |
| 3 | 2739 |
| 4 | 2922 |
Scaling and Optimizing Data Pipelines with Polars