Resilient distributed datasets in PySpark

Introduction to PySpark

Benjamin Schmidt

Data Engineer

What is parallelization in PySpark?

Automatically parallelizing data and computations across multiple nodes in a cluster
Distributed processing of large datasets across multiple nodes
Worker nodes process data in parallel, combining at the end of the task
Faster processing at scale (think gigabytes or even terabytes)

Parallelization

Understanding RDDs

RDDs or Resilient Distributed Datasets:

Distributed data collections across a cluster with automatic recovery from node failures
Good for large scale data
Immutable and can be transformed using operations like map() or filter(), with actions like collect() or paralelize()to retrieve results or create RDDs

Creating an RDD

# Initialize a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDExample").getOrCreate()


# Create a DataFrame from a csv
census_df = spark.read.csv("/census.csv")

# Convert DataFrame to RDD
census_rdd = census_df.rdd


# Show the RDD's contents using collect()
census_rdd.collect()

Showing Collect

# Collect the entire DataFrame into a local Python list of Row objects
data_collected = df.collect()

# Print the collected data
for row in data_collected:
    print(row)
```

RDDs vs DataFrames

DataFrames

High-level: Optimized for ease of use
SQL Like Operations: Work with SQL-like queries and perform complex operations with less code
Schema Information: Contain Columns and types like an SQL Table

RDDS

Low-level: More flexible but requiring more lines of code for complex operations
Type Safety: Preserve data types but don't have the optimization benefits of DataFrames
No Schema: Harder to work with structured data like SQL or relational data
Large Scaling
Very very verbose compared to DataFrames and poor at analytics

Some useful functions and methods

map(): method applies functions (including ones we write like a lambda function) across a dataset like: rdd.map(map_function)
collect(): collects data from across the cluster like: rdd.collect()

Let's practice!

Introduction to PySpark

Preparing Video For Download...