Intro to Spark SQL

Introduction to PySpark

Benjamin Schmidt

Data Engineer

What is Spark SQL

  • Module in Apache Spark for structured data processing
  • Allows us to run SQL queries alongside data processing tasks
  • Seamless combination of Python and SQL in one application
  • DataFrame Interfacing: Provides programmatic access to structured data
Introduction to PySpark

Creating temp tables

# Initialize Spark session
spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

# Sample DataFrame data = [("Alice", "HR", 30), ("Bob", "IT", 40), ("Cathy", "HR", 28)] columns = ["Name", "Department", "Age"] df = spark.createDataFrame(data, schema=columns)
# Register DataFrame as a temporary view df.createOrReplaceTempView("people")
# Query using SQL result = spark.sql("SELECT Name, Age FROM people WHERE Age > 30") result.show()
Introduction to PySpark

Deeper into temp views

  • Temp Views protect the underlying data while doing analytics
  • Loading from a CSV uses methods we already know
    df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)
    
# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")
Introduction to PySpark

Combining SQL and DataFrame operations

# SQL query result
query_result = spark.sql("SELECT Name, Salary FROM employees WHERE Salary > 3000")

# DataFrame transformation high_earners = query_result.withColumn("Bonus", query_result.Salary * 0.1) high_earners.show()
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...