Machine Learning & Spark

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

Building the perfect waffle (an analogy)

A single waffle.

Find waffle recipe. Give explicit instructions:

  • 125 g flour
  • 1 t baking powder
  • 1 egg
  • 225 ml milk
  • 1 T melted butter

A collection of waffles.

Find many waffle recipes.

Learn the perfect recipe:

  1. Look at lots of recipes.
  2. What ingredients?
  3. What proportions?

Computer generates its own instructions.

Machine Learning with PySpark

A plot of flour versus sugar for a regression model. A plot of salt versus sugar for a classification problem.

Machine Learning with PySpark

Data in RAM

When data are small the entire problem can fit in RAM.

Machine Learning with PySpark

Data exceeds RAM

When data too big for RAM it's swapped to disk.

Machine Learning with PySpark

Data distributed across a cluster

For very large data it makes sense to distribute the data across multiple computers.

Machine Learning with PySpark

What is Spark?

The Spark logo.

  • Compute across a distributed cluster.
  • Data processed in memory.
  • Well documented high-level API.
Machine Learning with PySpark

A collection of nodes in a cluster.

Machine Learning with PySpark

A collection of nodes in a cluster with a cluster manager.

Machine Learning with PySpark

A collection of nodes in a cluster with a cluster manager and driver.

Machine Learning with PySpark

Executors on each node in the cluster.

Machine Learning with PySpark

Onward!

Machine Learning with PySpark

Preparing Video For Download...