Cluster sizing tips

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Configuration options

  • Spark contains many configuration settings
  • These can be modified to match needs
  • Reading configuration settings:
    spark.conf.get(<configuration name>)
    
  • Writing configuration settings
    spark.conf.set(<configuration name>)
    
Cleaning Data with PySpark

Cluster Types

Spark deployment options:

  • Single node
  • Standalone
  • Managed
    • YARN
    • Mesos
    • Kubernetes
Cleaning Data with PySpark

Driver

  • Task assignment
  • Result consolidation
  • Shared data access

Tips:

  • Driver node should have double the memory of the worker
  • Fast local storage helpful
Cleaning Data with PySpark

Worker

  • Runs actual tasks
  • Ideally has all code, data, and resources for a given task

Recommendations:

  • More worker nodes is often better than larger workers
  • Test to find the balance
  • Fast local storage extremely useful
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...