Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
Apache Spark is written in Scala
To support Python with Spark, Apache Spark Community released PySpark
Similar computation speed and power as Scala
PySpark APIs are similar to Pandas and Scikit-learn
Interactive environment for running Spark jobs
Helpful for fast interactive prototyping
Spark shells allow interacting with data on disk or in memory
Three different Spark shells:
Spark-shell for Scala
PySpark-shell for Python
SparkR for R
PySpark shell is the Python-based command line tool
PySpark shell allows data scientists interface with Spark data structures
PySpark shell support connecting to a cluster
SparkContext is an entry point into the world of Spark
An entry point is a way of connecting to Spark cluster
An entry point is like a key to the house
PySpark has a default SparkContext called sc
sc.version
2.3.1
sc.pythonVer
3.6
sc.master
local[*]
parallelize()
methodrdd = sc.parallelize([1,2,3,4,5])
textFile()
methodrdd2 = sc.textFile("test.txt")
Big Data Fundamentals with PySpark