PySpark: Spark with Python

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

Overview of PySpark

  • Apache Spark is written in Scala

  • To support Python with Spark, Apache Spark Community released PySpark

  • Similar computation speed and power as Scala

  • PySpark APIs are similar to Pandas and Scikit-learn

Big Data Fundamentals with PySpark

What is Spark shell?

  • Interactive environment for running Spark jobs

  • Helpful for fast interactive prototyping

  • Spark shells allow interacting with data on disk or in memory

  • Three different Spark shells:

    • Spark-shell for Scala

    • PySpark-shell for Python

    • SparkR for R

Big Data Fundamentals with PySpark

PySpark shell

  • PySpark shell is the Python-based command line tool

  • PySpark shell allows data scientists interface with Spark data structures

  • PySpark shell support connecting to a cluster

Big Data Fundamentals with PySpark

Understanding SparkContext

  • SparkContext is an entry point into the world of Spark

  • An entry point is a way of connecting to Spark cluster

  • An entry point is like a key to the house

  • PySpark has a default SparkContext called sc

1 https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python
Big Data Fundamentals with PySpark

Inspecting SparkContext

  • Version: To retrieve SparkContext version
sc.version
2.3.1
  • Python Version: To retrieve Python version of SparkContext
sc.pythonVer
3.6
  • Master: URL of the cluster or local string to run in local mode of SparkContext
sc.master
local[*]
1 https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python
Big Data Fundamentals with PySpark

Loading data in PySpark

  • SparkContext's parallelize() method
rdd = sc.parallelize([1,2,3,4,5])
  • SparkContext's textFile() method
rdd2 = sc.textFile("test.txt")
1 https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python
Big Data Fundamentals with PySpark

Let's practice

Big Data Fundamentals with PySpark

Preparing Video For Download...