Intro to Spark SQL

Introduction to PySpark

Benjamin Schmidt

Data Engineer

What is Spark SQL

Module in Apache Spark for structured data processing
Allows us to run SQL queries alongside data processing tasks
Seamless combination of Python and SQL in one application

DataFrame Interfacing: Provides programmatic access to structured data

Creating temp tables

# Initialize Spark session
spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

# Sample DataFrame
data = [("Alice", "HR", 30), ("Bob", "IT", 40), ("Cathy", "HR", 28)]
columns = ["Name", "Department", "Age"]
df = spark.createDataFrame(data, schema=columns)

# Register DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Query using SQL
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 30")
result.show()

Deeper into temp views

Temp Views protect the underlying data while doing analytics

Loading from a CSV uses methods we already know

df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)

# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")

Combining SQL and DataFrame operations

# SQL query result
query_result = spark.sql("SELECT Name, Salary FROM employees WHERE Salary > 3000")

# DataFrame transformation
high_earners = query_result.withColumn("Bonus", query_result.Salary * 0.1)
high_earners.show()

Let's practice!

Introduction to PySpark