Spark SQL

Introduction to Spark SQL in Python

Mark Plutowski Phd

Data Scientist

Create SQL table and query it

Pyspark Shell cursor

Load a dataframe from file

df = spark.read.csv(filename)

df = spark.read.csv(filename, header=True)

Create SQL table and query it

df.createOrReplaceTempView("schedule")

spark.sql("SELECT * FROM schedule WHERE station = 'San Jose'")
     .show()

+--------+--------+-----+
|train_id| station| time|
+--------+--------+-----+
|     324|San Jose|9:05a|
|     217|San Jose|6:59a|
+--------+--------+-----+

Inspecting table schema

result = spark.sql("SHOW COLUMNS FROM tablename")

result = spark.sql("SELECT * FROM tablename LIMIT 0")

result = spark.sql("DESCRIBE tablename")

result.show()

print(result.columns)

Frame

Data Frame

Tabular data

+--------+-------------+-----+
|train_id|      station| time|
+--------+-------------+-----+
|     324|San Francisco|7:59a| 
|     324|  22nd Street|8:03a|
|     324|     Millbrae|8:16a|
|     324|    Hillsdale|8:24a|
|     324| Redwood City|8:31a|
|     324|    Palo Alto|8:37a|
|     324|     San Jose|9:05a|
|     217|       Gilroy|6:06a|
|     217|   San Martin|6:15a|
|     217|  Morgan Hill|6:21a|
|     217| Blossom Hill|6:36a|
|     217|      Capitol|6:42a|
|     217|       Tamien|6:50a|
|     217|     San Jose|6:59a|
+--------+-------------+-----+

two dataframes

two dataframes concatenated

one dataframe

splitting a dataframe 1

splitting a dataframe 2

splitting a dataframe 3

splitting a dataframe 4

splitting a dataframe 5

splitting a dataframe 6

splitting a dataframe distributed

SQL

Structured Query Language

querying

distributed data

distributed data + data

Loading delimited text

Loads a comma-delimited file trainsched.txt into a dataframe called df :

df = spark.read.csv("trainsched.txt", header=True)

Loading delimited text

df = spark.read.csv("trainsched.txt", header=True)
df.show()

+--------+-------------+-----+
|train_id|      station| time|
+--------+-------------+-----+
|     324|San Francisco|7:59a|
|     324|  22nd Street|8:03a|
|     324|     Millbrae|8:16a|
|     324|    Hillsdale|8:24a|
|     324| Redwood City|8:31a|
|     ...|          ...|  ...|
|     217| Blossom Hill|6:36a|
|     217|      Capitol|6:42a|
|     217|       Tamien|6:50a|
|     217|     San Jose|6:59a|
+--------+-------------+-----+

Pyspark Shell

Pyspark Shell cursor

Pyspark Shell

Pyspark Shell cursor

Pyspark Shell

Pyspark Shell cursor

Pyspark Shell

Pyspark Shell cursor

Pyspark Shell

Pyspark Shell cursor

Let's practice

Introduction to Spark SQL in Python