Immutability and Lazy Processing

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Variable review

Python variables:

  • Mutable
  • Flexibility
  • Potential for issues with concurrency
  • Likely adds complexity
Cleaning Data with PySpark

Immutability

Immutable variables are:

  • A component of functional programming
  • Defined once
  • Unable to be directly modified
  • Re-created if reassigned
  • Able to be shared efficiently
Cleaning Data with PySpark

Immutability Example

Define a new data frame:

voter_df = spark.read.csv('voterdata.csv')

Making changes:

voter_df = voter_df.withColumn('fullyear', 
    voter_df.year + 2000)

voter_df = voter_df.drop(voter_df.year)
Cleaning Data with PySpark

Lazy Processing

  • Isn't this slow?
  • Transformations
  • Actions
  • Allows efficient planning
voter_df = voter_df.withColumn('fullyear', 
    voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)

voter_df.count()
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...