Resilient distributed datasets in PySpark

Introduction to PySpark

Benjamin Schmidt

Data Engineer

What is parallelization in PySpark?

  • Automatically parallelizing data and computations across multiple nodes in a cluster
  • Distributed processing of large datasets across multiple nodes
  • Worker nodes process data in parallel, combining at the end of the task
  • Faster processing at scale (think gigabytes or even terabytes)

Parallelization

Introduction to PySpark

Understanding RDDs

RDDs or Resilient Distributed Datasets:

  • Distributed data collections across a cluster with automatic recovery from node failures
  • Good for large scale data
  • Immutable and can be transformed using operations like map() or filter(), with actions like collect() or paralelize()to retrieve results or create RDDs
Introduction to PySpark

Creating an RDD

# Initialize a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDExample").getOrCreate()

# Create a DataFrame from a csv census_df = spark.read.csv("/census.csv")
# Convert DataFrame to RDD census_rdd = census_df.rdd
# Show the RDD's contents using collect() census_rdd.collect()
Introduction to PySpark

Showing Collect

# Collect the entire DataFrame into a local Python list of Row objects
data_collected = df.collect()

# Print the collected data
for row in data_collected:
    print(row)
```    
Introduction to PySpark

RDDs vs DataFrames

DataFrames

  • High-level: Optimized for ease of use
  • SQL Like Operations: Work with SQL-like queries and perform complex operations with less code
  • Schema Information: Contain Columns and types like an SQL Table

RDDS

  • Low-level: More flexible but requiring more lines of code for complex operations
  • Type Safety: Preserve data types but don't have the optimization benefits of DataFrames
  • No Schema: Harder to work with structured data like SQL or relational data
  • Large Scaling
  • Very very verbose compared to DataFrames and poor at analytics
Introduction to PySpark

Some useful functions and methods

  • map(): method applies functions (including ones we write like a lambda function) across a dataset like: rdd.map(map_function)
  • collect(): collects data from across the cluster like: rdd.collect()
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...