Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Python variables:
Immutable variables are:
Define a new data frame:
voter_df = spark.read.csv('voterdata.csv')
Making changes:
voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
voter_df.count()
Cleaning Data with PySpark