U define it? U use it!

Introduction to PySpark

Benjamin Schmidt

Data Engineer

UDFs for repeatable tasks

UDF (User-Defined Function): custom function to work with data using PySpark dataframes

Advantages of UDFs:

Reuse and repeat common tasks
Registered directly with Spark and can be shared
PySpark DataFrames (for smaller datasets)
pandas UDFs (for larger datasets)

Defining and registering a UDF

All PySpark UDFs need to be registered via the udf() function.

# Define the function
def to_uppercase(s):
    return s.upper() if s else None


# Register the function
to_uppercase_udf = udf(to_uppercase, StringType())


# Apply the UDF to the DataFrame
df = df.withColumn("name_upper", to_uppercase_udf(df["name"]))


# See the results
df.show()

Remember: UDFs allow you to apply custom Python logic on PySpark DataFrames

pandas UDF

Eliminates costly conversions of code and data
Does not need to be registered to the SparkSession
Uses pandas capabilities on extremely large datasets

from pyspark.sql.functions import pandas_udf

@pandas_udf("float")
def fahrenheit_to_celsius_pandas(temp_f):
    return (temp_f - 32) * 5.0/9.0

PySpark UDFS vs. pandas UDFs

PySpark UDF

Best for relatively small datasets
Simple transformations like data cleaning
Changes occur at the columnar level, not the row level
Must be registered to a Spark Session with udf()

pandas UDF

Relatively large datasets
Complex operations beyond simple data cleaning
Specific row level changes over column level
Can be called outside the Spark Session

Let's practice!

Introduction to PySpark