Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Data Cleaning: Preparing raw data for use in data processing pipelines.
Possible tasks in data cleaning:
Problems with typical data systems:
Advantages of Spark:
Raw data:
name | age (years) | city |
---|---|---|
Smith, John | 37 | Dallas |
Wilson, A. | 59 | Chicago |
null | 215 |
Cleaned data:
last name | first name | age (months) | state |
---|---|---|---|
Smith | John | 444 | TX |
Wilson | A. | 708 | IL |
Import schema
import pyspark.sql.types
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])
Read CSV file containing data
people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)
Cleaning Data with PySpark