Working with Parquet files

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

File storage formats

Diagram of a DataFrame

Scaling and Optimizing Data Pipelines with Polars

CSV Format

Diagram showing CSV as row-by-row storage, with each record stored across a full row.

Scaling and Optimizing Data Pipelines with Polars

Parquet Format

Diagram showing Parquet as column-oriented storage, with values from the same column grouped together.

Scaling and Optimizing Data Pipelines with Polars

Converting the archive

(
    pl.read_csv("311_Service_Requests.csv", try_parse_dates=True)

)
Scaling and Optimizing Data Pipelines with Polars

Converting the archive

(
    pl.read_csv("311_Service_Requests.csv", try_parse_dates=True)
    .write_parquet("311_Service_Requests.parquet")
)
Scaling and Optimizing Data Pipelines with Polars

Inspecting a Parquet file

pl.read_parquet_schema(
    "311_Service_Requests.parquet"
)
Schema(
    "CREATED_DATE": Datetime,
    "TYPE": String,
    "STATUS": String,
    "DEPARTMENT": String,
    "WARD": Int64,
    ...
)
Scaling and Optimizing Data Pipelines with Polars

CSV vs Parquet

$$

CSV
  • 4.8 GB
  • 14 seconds

$$

Parquet
  • 0.6 GB
  • 1.4 seconds

csv vs parquet illustration

Scaling and Optimizing Data Pipelines with Polars

Parquet is not for appending rows

new_request = {"TYPE": "Pothole", "STATUS": "Open", ...}
  • CSV for appends
  • Parquet for reads
Scaling and Optimizing Data Pipelines with Polars

Parquet row groups

Diagram showing Parquet file divided into row groups.

Scaling and Optimizing Data Pipelines with Polars

Filtering with row groups

(
    pl.scan_parquet("311_Service_Requests.parquet")
    .filter(pl.col("CREATED_DATE") < pl.datetime(2020, 1, 1))
)
Scaling and Optimizing Data Pipelines with Polars

Parquet row groups

Diagram of a Parquet file with row groups.

Scaling and Optimizing Data Pipelines with Polars

Parquet row groups

Diagram of a Parquet file with row groups and statistics.

Scaling and Optimizing Data Pipelines with Polars

Scanning with parallel strategies

(
    pl.scan_parquet(
        "311_Service_Requests.parquet",

    )
)
Scaling and Optimizing Data Pipelines with Polars

Scanning with parallel strategies

(
    pl.scan_parquet(
        "311_Service_Requests.parquet",
        parallel="row_groups",
    )
)
  • columns
  • prefiltered
  • none
Scaling and Optimizing Data Pipelines with Polars

Controlling Parquet writes

department_counts.write_parquet(
    "department_counts.parquet",
    compression_level=3,

)
  • Smaller file sizes
  • Longer read and write times
Scaling and Optimizing Data Pipelines with Polars

Controlling Parquet writes

department_counts.write_parquet(
    "department_counts.parquet",
    compression_level=3,
    row_group_size=25_000,
)
  • Smaller file sizes
  • Longer read and write times
  • More statistics to read
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...