Parsing CSVs

Scaling and Optimizing Data Pipelines with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Finance CSV extract

Chicago Finance Department
Generated: 2026-02-01
WARD;TYPE;STATUS;REQUEST_COST
11;Pothole in Street Complaint;Completed;125
42;Street Light Out;Open;88
27;Alley Light Out;Completed;61
  • 🛑 Header is not the first row
  • 🛑 Columns are separated with semicolons
Scaling and Optimizing Data Pipelines with Polars

Skipping rows

vendor_requests = pl.read_csv(
    "ward_service_requests.csv",
    skip_rows=2,

)
Scaling and Optimizing Data Pipelines with Polars

Skipping extra header rows

vendor_requests = pl.read_csv(
    "ward_service_requests.csv",
    skip_rows=2,
    separator=";",
)
Scaling and Optimizing Data Pipelines with Polars

Checking the parsed file

vendor_requests.head(3)
shape: (3, 4)
| WARD | TYPE                        | STATUS    | REQUEST_COST |
| ---  | ---                         | ---       | ---          |
| i64  | str                         | str       | i64          |
|------|-----------------------------|-----------|--------------|
| 11   | Pothole in Street Complaint | Completed | 125          |
| 42   | Street Light Out            | Open      | 88           |
| 27   | Alley Light Out             | Completed | 61           |
Scaling and Optimizing Data Pipelines with Polars

A schema inference problem

WARD;TYPE;STATUS;REQUEST_COST
11;Pothole in Street Complaint;Completed;125
42;Street Light Out;Open;88
27;Alley Light Out;Completed;61


$$

  • First 100 rows → schema inferred
  • Ints only → column as Int64
Scaling and Optimizing Data Pipelines with Polars

Schema inference

WARD;TYPE;STATUS;REQUEST_COST
11;Pothole in Street Complaint;Completed;125
42;Street Light Out;Open;88
27;Alley Light Out;Completed;61
...
3;Rodent Baiting Service Request;Completed;61.5

$$

  • First 100 rows → schema inferred
  • Ints only → column as Int64
Scaling and Optimizing Data Pipelines with Polars

Schema inference

vendor_requests = pl.read_csv(
    "ward_service_requests.csv",
    separator=";",
    skip_rows=2,
    infer_schema_length=200,
)
Scaling and Optimizing Data Pipelines with Polars

Schema inference

vendor_requests.schema
Schema({'WARD': Int64, 'TYPE': String, 'STATUS': String, 'REQUEST_COST': Float64})
Scaling and Optimizing Data Pipelines with Polars

Providing the schema

vendor_requests = pl.read_csv(
    "ward_service_requests.csv",
    separator=";",
    skip_rows=2,
    schema={
        "WARD": pl.Int64,
        "TYPE": pl.String,
        "STATUS": pl.String,
        "REQUEST_COST": pl.Float64,
    },
)
Scaling and Optimizing Data Pipelines with Polars

Overriding the inferred schema

vendor_requests = pl.read_csv(
    "ward_service_requests.csv",
    separator=";",
    skip_rows=2,
    schema_overrides={"REQUEST_COST": pl.Float64},
)
Scaling and Optimizing Data Pipelines with Polars

Checking the override

vendor_requests.schema
Schema({'WARD': Int64, 'TYPE': String, 'STATUS': String, 'REQUEST_COST': Float64})
Scaling and Optimizing Data Pipelines with Polars

A bad data problem

WARD;TYPE;STATUS;REQUEST_COST
11;Pothole in Street Complaint;Completed;125.0
unknown;Street Light Out;Open;88.5
27;Alley Light Out;Completed;61.0
Scaling and Optimizing Data Pipelines with Polars

Ignoring parse errors

pl.read_csv(
    "ward_service_requests.csv",
    separator=";",
    skip_rows=2,
    infer_schema_length=200,
    ignore_errors=True,
)

$$

  • Bad values → null
  • Hides other errors
Scaling and Optimizing Data Pipelines with Polars

Marking null values

pl.read_csv(
    "ward_service_requests.csv",
    separator=";",
    skip_rows=2,
    infer_schema_length=200,
    null_values={"WARD": "unknown"},
)

$$

  • Preserves intended dtype
  • Safer than ignore_errors
Scaling and Optimizing Data Pipelines with Polars

Let's practice!

Scaling and Optimizing Data Pipelines with Polars

Preparing Video For Download...