Cluster sizing tips

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Configuration options

Spark contains many configuration settings
These can be modified to match needs
Reading configuration settings:
```
spark.conf.get(<configuration name>)
```
Writing configuration settings
```
spark.conf.set(<configuration name>)
```

Cluster Types

Spark deployment options:

Single node
Standalone
Managed
- YARN
- Mesos
- Kubernetes

Driver

Task assignment
Result consolidation
Shared data access

Tips:

Driver node should have double the memory of the worker
Fast local storage helpful

Worker

Runs actual tasks
Ideally has all code, data, and resources for a given task

Recommendations:

More worker nodes is often better than larger workers
Test to find the balance
Fast local storage extremely useful

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...