Data opschonen met PySpark
Mike Metzger
Data Engineering Consultant
Python-variabelen:
Immutabele variabelen zijn:
Nieuwe DataFrame definiëren:
voter_df = spark.read.csv('voterdata.csv')
Wijzigingen aanbrengen:
voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000)voter_df = voter_df.drop(voter_df.year)
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
voter_df.count()
Data opschonen met PySpark