More on Spark DataFrames

Introduction to PySpark

Benjamin Schmidt

Data Engineer

Creating DataFrames from various data sources

CSV Files: Common for structured, delimited data
JSON Files: Semi-structured, hierarchical data format
Parquet Files: Optimized for storage and querying, often used in data engineering

Example:
```
spark.read.csv("path/to/file.csv")
```
Example:
```
spark.read.json("path/to/file.json")
```

Example:

spark.read.parquet("path/to/file.parquet")

¹ https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv

Schema inference and manual schema definition

Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures

Schema at Scale

DataTypes in PySpark DataFrames

IntegerType: Whole numbers
- E.g., 1, 3478, -1890456
LongType: Larger whole numbers
- E.g., 8-byte signed numbers, 922334775806
FloatType and DoubleType: Floating-point numbers for decimal values
- E.g., 3.14159
StringType: Used for text or string data
- E.g., "This is an example of a string."
...

DataTypes Syntax for PySpark DataFrames

# Import the necessary types as classes
from pyspark.sql.types import (StructType,
                            StructField, IntegerType,
                            StringType, ArrayType)

# Construct the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema
df = spark.createDataFrame(data, schema=schema)

DataFrame operations - selection and filtering

Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions
Use .sort() to order by a collection of columns

# Select and show only the name and age columns
df.select("name", "age").show()

# Filter on age > 30
df.filter(df["age"] > 30).show()

# Use Where to filter match a specific value
df.where(df["age"] == 30).show()

# Use Sort to sort on age
df.sort("age", ascending=False).show()

Sorting and dropping missing values

Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values

# Sort using the age column
df.sort("age", ascending=False).show()

# Drop missing values
df.na.drop().show()

Cheatsheet

spark.read_json(): Load data from JSON
spark.read.schema(): Define schemas explicitly
.na.drop(): Drop rows with missing values
.select(), .filter(), .sort(), .orderBy(): Basic data manipulation functions

Let's practice!

Introduction to PySpark