Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
withColumn()
, .filter()
, .drop()
schema = StructType([
StructField('name', StringType(), False),
StructField('age', StringType(), False)
])
df = spark.read.format('csv').load('datafile').schema(schema)
df = df.withColumn('id', monotonically_increasing_id())
...
df.write.parquet('outdata.parquet')
df.write.json('outdata.json')
Cleaning Data with PySpark