Introduction to PySpark DataFrames

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What are PySpark DataFrames?

PySpark SQL is a Spark library for structured data. It provides more information about the structure of data and computation
PySpark DataFrame is an immutable distributed collection of data with named columns
Designed for processing both structured (e.g relational database) and semi-structured data (e.g JSON)
Dataframe API is available in Python, R, Scala, and Java
DataFrames in PySpark support both SQL queries (SELECT * from table) or expression methods (df.select())

SparkSession - Entry point for DataFrame API

SparkContext is the main entry point for creating RDDs
SparkSession provides a single point of entry to interact with Spark DataFrames
SparkSession is used to create DataFrame, register DataFrames, execute SQL queries
SparkSession is available in PySpark shell as spark

Creating DataFrames in PySpark

Two different methods of creating DataFrames in PySpark
- From existing RDDs using SparkSession's createDataFrame() method
- From various data sources (CSV, JSON, TXT) using SparkSession's read method
Schema controls the data and helps DataFrames to optimize queries
Schema provides information about column name, type of data in the column, empty values etc.,

Create a DataFrame from RDD

iphones_RDD = sc.parallelize([
    ("XS", 2018, 5.65, 2.79, 6.24),
    ("XR", 2018, 5.94, 2.98, 6.84),
    ("X10", 2017, 5.65, 2.79, 6.13),
    ("8Plus", 2017, 6.23, 3.07, 7.12)
])

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

type(iphones_df)

pyspark.sql.dataframe.DataFrame

Create a DataFrame from reading a CSV/JSON/TXT

df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)

df_json = spark.read.json("people.json")

df_txt = spark.read.txt("people.txt")

Path to the file and two optional parameters
Two optional parameters
- header=True, inferSchema=True

Let's practice

Big Data Fundamentals with PySpark