Improve import performance

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Spark clusters

Spark Clusters are made of two types of processes

Important parameters:

A well-defined schema will drastically improve import performance

Use OS utilities / scripts (split, cut, awk)
```
split -l 10000 -d largefile chunk-
```
Use custom scripts

Write out to Parquet

df_csv = spark.read.csv('singlelargefile.csv')
df_csv.write.parquet('data.parquet')
df = spark.read.parquet('data.parquet')

Cleaning Data with PySpark