Final analysis and delivery

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Analysis calculations (UDF)

Calculations using UDF

def getAvgSale(saleslist):
  totalsales = 0
  count = 0
  for sale in saleslist:
    totalsales += sale[2] + sale[3]
    count += 2
  return totalsales / count

udfGetAvgSale = udf(getAvgSale, DoubleType()) df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list))
Cleaning Data with PySpark

Analysis calculations (inline)

Inline calculations

df = df.read.csv('datafile')

df = df.withColumn('avg', (df.total_sales / df.sales_count))
df = df.withColumn('sq_ft', df.width * df.length)
df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...