Machine Learning with PySpark
Andrew Collier
Data Scientist, Fathom Data
Languages for interacting with Spark.
From Python import the pyspark
module.
import pyspark
Check version of the pyspark
module.
pyspark.__version__
'2.4.1'
In addition to pyspark
there are
pyspark.sql
pyspark.streaming
pyspark.mllib
(deprecated) and pyspark.ml
Remote Cluster using Spark URL — spark://<IP address | DNS name>:<port>
Example:
spark://13.59.151.161:7077
Local Cluster
Examples:
local
— only 1 core;local[4]
— 4 cores; orlocal[*]
— all available cores.from pyspark.sql import SparkSession
Create a local cluster using a SparkSession
builder.
spark = SparkSession.builder \
.master('local[*]') \
.appName('first_spark_application') \
.getOrCreate()
Interact with Spark...
# Close connection to Spark
>>> spark.stop()
Machine Learning with PySpark