Eindanalyse en oplevering

Data opschonen met PySpark

Mike Metzger

Data Engineering Consultant

Analyseberekeningen (UDF)

Berekeningen met UDF

def getAvgSale(saleslist):
  totalsales = 0
  count = 0
  for sale in saleslist:
    totalsales += sale[2] + sale[3]
    count += 2
  return totalsales / count

udfGetAvgSale = udf(getAvgSale, DoubleType()) df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list))
Data opschonen met PySpark

Analyseberekeningen (inline)

Inline-berekeningen

df = df.read.csv('datafile')

df = df.withColumn('avg', (df.total_sales / df.sales_count))
df = df.withColumn('sq_ft', df.width * df.length)
df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)
Data opschonen met PySpark

Laten we oefenen!

Data opschonen met PySpark

Preparing Video For Download...