More on Spark DataFrames

Introduction to PySpark

Benjamin Schmidt

Data Engineer

Creating DataFrames from various data sources

  • CSV Files: Common for structured, delimited data
  • JSON Files: Semi-structured, hierarchical data format
  • Parquet Files: Optimized for storage and querying, often used in data engineering
  • Example:
    spark.read.csv("path/to/file.csv")
    
  • Example:
    spark.read.json("path/to/file.json")
    
  • Example:
    spark.read.parquet("path/to/file.parquet")
    
1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv
Introduction to PySpark

Schema inference and manual schema definition

  • Spark can infer schemas from data with inferSchema=True

  • Manually define schema for better control - useful for fixed data structures

Schema at Scale

Introduction to PySpark

DataTypes in PySpark DataFrames

  • IntegerType: Whole numbers
    • E.g., 1, 3478, -1890456
  • LongType: Larger whole numbers
    • E.g., 8-byte signed numbers, 922334775806
  • FloatType and DoubleType: Floating-point numbers for decimal values
    • E.g., 3.14159
  • StringType: Used for text or string data
    • E.g., "This is an example of a string."
  • ...
Introduction to PySpark

DataTypes Syntax for PySpark DataFrames

# Import the necessary types as classes
from pyspark.sql.types import (StructType,
                            StructField, IntegerType,
                            StringType, ArrayType)

# Construct the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema
df = spark.createDataFrame(data, schema=schema)
Introduction to PySpark

DataFrame operations - selection and filtering

  • Use .select() to choose specific columns
  • Use .filter() or .where() to filter rows based on conditions
  • Use .sort() to order by a collection of columns
# Select and show only the name and age columns
df.select("name", "age").show()
# Filter on age > 30
df.filter(df["age"] > 30).show()
# Use Where to filter match a specific value
df.where(df["age"] == 30).show()
# Use Sort to sort on age
df.sort("age", ascending=False).show()
Introduction to PySpark

Sorting and dropping missing values

  • Order data using .sort() or .orderBy()
  • Use na.drop() to remove rows with null values
# Sort using the age column
df.sort("age", ascending=False).show()

# Drop missing values
df.na.drop().show()

Introduction to PySpark

Cheatsheet

  • spark.read_json(): Load data from JSON
  • spark.read.schema(): Define schemas explicitly
  • .na.drop(): Drop rows with missing values
  • .select(), .filter(), .sort(), .orderBy(): Basic data manipulation functions
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...