Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Spark Clusters are made of two types of processes
Important parameters:
airport_df = spark.read.csv('airports-*.txt.gz')
A well-defined schema will drastically improve import performance
split -l 10000 -d largefile chunk-
df_csv = spark.read.csv('singlelargefile.csv')
df_csv.write.parquet('data.parquet')
df = spark.read.parquet('data.parquet')
Cleaning Data with PySpark