Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Calculations using UDF
def getAvgSale(saleslist): totalsales = 0 count = 0 for sale in saleslist: totalsales += sale[2] + sale[3] count += 2 return totalsales / count
udfGetAvgSale = udf(getAvgSale, DoubleType()) df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list))
Inline calculations
df = df.read.csv('datafile')
df = df.withColumn('avg', (df.total_sales / df.sales_count))
df = df.withColumn('sq_ft', df.width * df.length)
df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)
Cleaning Data with PySpark