U define it? U use it!

Introduction to PySpark

Benjamin Schmidt

Data Engineer

UDFs for repeatable tasks

UDF (User-Defined Function): custom function to work with data using PySpark dataframes

Advantages of UDFs:

  • Reuse and repeat common tasks
  • Registered directly with Spark and can be shared
  • PySpark DataFrames (for smaller datasets)
  • pandas UDFs (for larger datasets)
Introduction to PySpark

Defining and registering a UDF

All PySpark UDFs need to be registered via the udf() function.

# Define the function
def to_uppercase(s):
    return s.upper() if s else None

# Register the function to_uppercase_udf = udf(to_uppercase, StringType())
# Apply the UDF to the DataFrame df = df.withColumn("name_upper", to_uppercase_udf(df["name"]))
# See the results df.show()

Remember: UDFs allow you to apply custom Python logic on PySpark DataFrames

Introduction to PySpark

pandas UDF

  • Eliminates costly conversions of code and data
  • Does not need to be registered to the SparkSession
  • Uses pandas capabilities on extremely large datasets
from pyspark.sql.functions import pandas_udf

@pandas_udf("float")
def fahrenheit_to_celsius_pandas(temp_f):
    return (temp_f - 32) * 5.0/9.0
Introduction to PySpark

PySpark UDFS vs. pandas UDFs

PySpark UDF

  • Best for relatively small datasets
  • Simple transformations like data cleaning
  • Changes occur at the columnar level, not the row level
  • Must be registered to a Spark Session with udf()

pandas UDF

  • Relatively large datasets
  • Complex operations beyond simple data cleaning
  • Specific row level changes over column level
  • Can be called outside the Spark Session
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...