Connecting to Spark

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

Interacting with Spark

Logos for Java, Scala, Python and R.

Languages for interacting with Spark.

  • Java — low-level, compiled
  • Scala, Python and R — high-level with interactive REPL
Machine Learning with PySpark

Importing pyspark

From Python import the pyspark module.

import pyspark

Check version of the pyspark module.

pyspark.__version__
'2.4.1'
Machine Learning with PySpark

Sub-modules

In addition to pyspark there are

Machine Learning with PySpark

Spark URL

Remote Cluster using Spark URL — spark://<IP address | DNS name>:<port>

Example:

  • spark://13.59.151.161:7077

Local Cluster

Examples:

  • local — only 1 core;
  • local[4] — 4 cores; or
  • local[*] — all available cores.
Machine Learning with PySpark

Creating a SparkSession

from pyspark.sql import SparkSession

Create a local cluster using a SparkSession builder.

spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('first_spark_application') \
                    .getOrCreate()

Interact with Spark...

# Close connection to Spark
>>> spark.stop()
Machine Learning with PySpark

Let's connect to Spark!

Machine Learning with PySpark

Preparing Video For Download...