Spark SQL

Introduction to Spark SQL in Python

Mark Plutowski Phd

Data Scientist

Create SQL table and query it

Pyspark Shell cursor

Introduction to Spark SQL in Python

Load a dataframe from file

df = spark.read.csv(filename)
df = spark.read.csv(filename, header=True)
Introduction to Spark SQL in Python

Create SQL table and query it

df.createOrReplaceTempView("schedule")

spark.sql("SELECT * FROM schedule WHERE station = 'San Jose'") .show()
+--------+--------+-----+
|train_id| station| time|
+--------+--------+-----+
|     324|San Jose|9:05a|
|     217|San Jose|6:59a|
+--------+--------+-----+
Introduction to Spark SQL in Python

Inspecting table schema

result = spark.sql("SHOW COLUMNS FROM tablename")
result = spark.sql("SELECT * FROM tablename LIMIT 0")
result = spark.sql("DESCRIBE tablename")
result.show()
print(result.columns)
Introduction to Spark SQL in Python

Frame

Introduction to Spark SQL in Python

Data Frame

Introduction to Spark SQL in Python

Tabular data

+--------+-------------+-----+
|train_id|      station| time|
+--------+-------------+-----+
|     324|San Francisco|7:59a| 
|     324|  22nd Street|8:03a|
|     324|     Millbrae|8:16a|
|     324|    Hillsdale|8:24a|
|     324| Redwood City|8:31a|
|     324|    Palo Alto|8:37a|
|     324|     San Jose|9:05a|
|     217|       Gilroy|6:06a|
|     217|   San Martin|6:15a|
|     217|  Morgan Hill|6:21a|
|     217| Blossom Hill|6:36a|
|     217|      Capitol|6:42a|
|     217|       Tamien|6:50a|
|     217|     San Jose|6:59a|
+--------+-------------+-----+
Introduction to Spark SQL in Python

two dataframes

Introduction to Spark SQL in Python

two dataframes concatenated

Introduction to Spark SQL in Python

one dataframe

Introduction to Spark SQL in Python

splitting a dataframe 1

Introduction to Spark SQL in Python

splitting a dataframe 2

Introduction to Spark SQL in Python

splitting a dataframe 3

Introduction to Spark SQL in Python

splitting a dataframe 4

Introduction to Spark SQL in Python

splitting a dataframe 5

Introduction to Spark SQL in Python

splitting a dataframe 6

Introduction to Spark SQL in Python

splitting a dataframe distributed

Introduction to Spark SQL in Python

SQL

Introduction to Spark SQL in Python

Structured Query Language

Introduction to Spark SQL in Python

querying

Introduction to Spark SQL in Python

distributed data

Introduction to Spark SQL in Python

distributed data + data

Introduction to Spark SQL in Python

Loading delimited text

Loads a comma-delimited file trainsched.txt into a dataframe called df :

df = spark.read.csv("trainsched.txt", header=True)
Introduction to Spark SQL in Python

Loading delimited text

df = spark.read.csv("trainsched.txt", header=True)
df.show()
+--------+-------------+-----+
|train_id|      station| time|
+--------+-------------+-----+
|     324|San Francisco|7:59a|
|     324|  22nd Street|8:03a|
|     324|     Millbrae|8:16a|
|     324|    Hillsdale|8:24a|
|     324| Redwood City|8:31a|
|     ...|          ...|  ...|
|     217| Blossom Hill|6:36a|
|     217|      Capitol|6:42a|
|     217|       Tamien|6:50a|
|     217|     San Jose|6:59a|
+--------+-------------+-----+
Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Pyspark Shell cursor

Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Pyspark Shell cursor

Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Pyspark Shell cursor

Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Pyspark Shell cursor

Introduction to Spark SQL in Python

Pyspark Shell

Introduction to Spark SQL in Python

Pyspark Shell cursor

Introduction to Spark SQL in Python

Let's practice

Introduction to Spark SQL in Python

Preparing Video For Download...