Window Function SQL

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

What is a Window Function SQL?

  • Express operations more simply than dot notation or queries
  • Each row uses the values of other rows to calculate its value
Introduction to Spark SQL in Python

A train schedule

train_id station time
324 San Francisco 7:59
324 22nd Street 8:03
324 Millbrae 8:16
324 Hillsdale 8:24
324 Redwood City 8:31
324 Palo Alto 8:37
324 San Jose 9:05
Introduction to Spark SQL in Python

Column with time until next stop added

train_id station time time_to_next_stop
324 San Francisco 7:59 4 min
324 22nd Street 8:03 13 min
324 Millbrae 8:16 8 min
324 Hillsdale 8:24 7 min
324 Redwood City 8:31 6 min
324 Palo Alto 8:37 28 min
324 San Jose 9:05 null
Introduction to Spark SQL in Python

Column with time of next stop

train_id station time time (following row)
324 San Francisco 7:59 8:03
324 22nd Street 8:03 8:16
324 Millbrae 8:16 8:24
324 Hillsdale 8:24 8:31
324 Redwood City 8:31 8:37
324 Palo Alto 8:37 9:05
324 San Jose 9:05 null
Introduction to Spark SQL in Python

OVER clause and ORDER BY clause

query = """
SELECT train_id, station, time, 
LEAD(time, 1) OVER (ORDER BY time) AS time_next 
FROM sched 
WHERE train_id=324 """

spark.sql(query).show()
+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Introduction to Spark SQL in Python

PARTITION BY clause

SELECT 
train_id, 
station, 
time, 
LEAD(time,1) OVER (PARTITION BY train_id ORDER BY time) AS time_next 
FROM sched 
Introduction to Spark SQL in Python

Result of adding PARTITION BY clause

+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     217|       Gilroy|6:06 |    6:15 |
|     217|   San Martin|6:15 |    6:21 |
|     217|  Morgan Hill|6:21 |    6:36 |
|     217| Blossom Hill|6:36 |    6:42 |
|     217|      Capitol|6:42 |    6:50 |
|     217|       Tamien|6:50 |    6:59 |
|     217|     San Jose|6:59 |    null |
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Introduction to Spark SQL in Python
train_id station time time_to_next_stop
324 San Francisco 7:59 4 min
324 22nd Street 8:03 13 min
324 Millbrae 8:16 8 min
324 Hillsdale 8:24 7 min
324 Redwood City 8:31 6 min
324 Palo Alto 8:37 28 min
324 San Jose 9:05 null
Introduction to Spark SQL in Python

Let's practice

Introduction to Spark SQL in Python

Preparing Video For Download...