Introduction to PySpark DataFrames

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What are PySpark DataFrames?

  • PySpark SQL is a Spark library for structured data. It provides more information about the structure of data and computation

  • PySpark DataFrame is an immutable distributed collection of data with named columns

  • Designed for processing both structured (e.g relational database) and semi-structured data (e.g JSON)

  • Dataframe API is available in Python, R, Scala, and Java

  • DataFrames in PySpark support both SQL queries (SELECT * from table) or expression methods (df.select())

Big Data Fundamentals with PySpark

SparkSession - Entry point for DataFrame API

  • SparkContext is the main entry point for creating RDDs

  • SparkSession provides a single point of entry to interact with Spark DataFrames

  • SparkSession is used to create DataFrame, register DataFrames, execute SQL queries

  • SparkSession is available in PySpark shell as spark

Big Data Fundamentals with PySpark

Creating DataFrames in PySpark

  • Two different methods of creating DataFrames in PySpark

    • From existing RDDs using SparkSession's createDataFrame() method

    • From various data sources (CSV, JSON, TXT) using SparkSession's read method

  • Schema controls the data and helps DataFrames to optimize queries

  • Schema provides information about column name, type of data in the column, empty values etc.,

Big Data Fundamentals with PySpark

Create a DataFrame from RDD

iphones_RDD = sc.parallelize([
    ("XS", 2018, 5.65, 2.79, 6.24),
    ("XR", 2018, 5.94, 2.98, 6.84),
    ("X10", 2017, 5.65, 2.79, 6.13),
    ("8Plus", 2017, 6.23, 3.07, 7.12)
])
names = ['Model', 'Year', 'Height', 'Width', 'Weight']
iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

type(iphones_df)
pyspark.sql.dataframe.DataFrame
Big Data Fundamentals with PySpark

Create a DataFrame from reading a CSV/JSON/TXT

df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)
df_json = spark.read.json("people.json")
df_txt = spark.read.txt("people.txt")
  • Path to the file and two optional parameters

  • Two optional parameters

    • header=True, inferSchema=True
Big Data Fundamentals with PySpark

Let's practice

Big Data Fundamentals with PySpark

Preparing Video For Download...