Understanding Parquet

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Difficulties with CSV files

  • No defined schema
  • Nested data requires special handling
  • Encoding format limited
Cleaning Data with PySpark

Spark and CSV files

  • Slow to parse
  • Files cannot be filtered (no "predicate pushdown")
  • Any intermediate use requires redefining schema
Cleaning Data with PySpark

The Parquet Format

  • A columnar data format
  • Supported in Spark and other data processing frameworks
  • Supports predicate pushdown
  • Automatically stores schema information
Cleaning Data with PySpark

Working with Parquet

Reading Parquet files

df = spark.read.format('parquet').load('filename.parquet')
df = spark.read.parquet('filename.parquet')

Writing Parquet files

df.write.format('parquet').save('filename.parquet')
df.write.parquet('filename.parquet')
Cleaning Data with PySpark

Parquet and SQL

Parquet as backing stores for SparkSQL operations

flight_df = spark.read.parquet('flights.parquet')
flight_df.createOrReplaceTempView('flights')
short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')
Cleaning Data with PySpark

Let's Practice!

Cleaning Data with PySpark

Preparing Video For Download...