User defined functions

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Defined...

User defined functions or UDFs

  • Python method
  • Wrapped via the pyspark.sql.functions.udf method
  • Stored as a variable
  • Called like a normal Spark function
Cleaning Data with PySpark

Reverse string UDF

Define a Python method

def reverseString(mystr):
    return mystr[::-1]

Wrap the function and store as a variable

udfReverseString = udf(reverseString, StringType())

Use with Spark

user_df = user_df.withColumn('ReverseName', 
                 udfReverseString(user_df.Name))
Cleaning Data with PySpark

Argument-less example

def sortingCap():
    return random.choice(['G', 'H', 'R', 'S'])
udfSortingCap = udf(sortingCap, StringType())
user_df = user_df.withColumn('Class', udfSortingCap())
Name Age Class
Alice 14 H
Bob 18 S
Candice 63 G
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...