Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
User defined functions or UDFs
pyspark.sql.functions.udf
methodDefine a Python method
def reverseString(mystr):
return mystr[::-1]
Wrap the function and store as a variable
udfReverseString = udf(reverseString, StringType())
Use with Spark
user_df = user_df.withColumn('ReverseName',
udfReverseString(user_df.Name))
def sortingCap():
return random.choice(['G', 'H', 'R', 'S'])
udfSortingCap = udf(sortingCap, StringType())
user_df = user_df.withColumn('Class', udfSortingCap())
Name | Age | Class |
---|---|---|
Alice | 14 | H |
Bob | 18 | S |
Candice | 63 | G |
Cleaning Data with PySpark