PySpark: Spark with Python

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

Overview of PySpark

Interactive environment for running Spark jobs
Helpful for fast interactive prototyping
Spark shells allow interacting with data on disk or in memory
Three different Spark shells:
- Spark-shell for Scala
- PySpark-shell for Python
- SparkR for R

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

sc.version

2.3.1

sc.pythonVer

3.6

Master: URL of the cluster or local string to run in local mode of SparkContext

sc.master

local[*]

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

rdd = sc.parallelize([1,2,3,4,5])

rdd2 = sc.textFile("test.txt")

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

Big Data Fundamentals with PySpark