Improve import performance

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Spark clusters

Spark Clusters are made of two types of processes

  • Driver process
  • Worker processes
Cleaning Data with PySpark

Import performance

Important parameters:

  • Number of objects (Files, Network locations, etc)
    • More objects better than larger ones
    • Can import via wildcard
      airport_df = spark.read.csv('airports-*.txt.gz')
      
  • General size of objects
    • Spark performs better if objects are of similar size
Cleaning Data with PySpark

Schemas

A well-defined schema will drastically improve import performance

  • Avoids reading the data multiple times
  • Provides validation on import
Cleaning Data with PySpark

How to split objects

  • Use OS utilities / scripts (split, cut, awk)
    split -l 10000 -d largefile chunk-
    
  • Use custom scripts
  • Write out to Parquet
    df_csv = spark.read.csv('singlelargefile.csv')
    df_csv.write.parquet('data.parquet')
    df = spark.read.parquet('data.parquet')
    
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...