Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
PySpark SQL is a Spark library for structured data. It provides more information about the structure of data and computation
PySpark DataFrame is an immutable distributed collection of data with named columns
Designed for processing both structured (e.g relational database) and semi-structured data (e.g JSON)
Dataframe API is available in Python, R, Scala, and Java
DataFrames in PySpark support both SQL queries (SELECT * from table
) or expression methods (df.select()
)
SparkContext is the main entry point for creating RDDs
SparkSession provides a single point of entry to interact with Spark DataFrames
SparkSession is used to create DataFrame, register DataFrames, execute SQL queries
SparkSession is available in PySpark shell as spark
Two different methods of creating DataFrames in PySpark
From existing RDDs using SparkSession's createDataFrame() method
From various data sources (CSV, JSON, TXT) using SparkSession's read method
Schema controls the data and helps DataFrames to optimize queries
Schema provides information about column name, type of data in the column, empty values etc.,
iphones_RDD = sc.parallelize([
("XS", 2018, 5.65, 2.79, 6.24),
("XR", 2018, 5.94, 2.98, 6.84),
("X10", 2017, 5.65, 2.79, 6.13),
("8Plus", 2017, 6.23, 3.07, 7.12)
])
names = ['Model', 'Year', 'Height', 'Width', 'Weight']
iphones_df = spark.createDataFrame(iphones_RDD, schema=names)
type(iphones_df)
pyspark.sql.dataframe.DataFrame
df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)
df_json = spark.read.json("people.json")
df_txt = spark.read.txt("people.txt")
Path to the file and two optional parameters
Two optional parameters
header=True
, inferSchema=True
Big Data Fundamentals with PySpark