Introduction to Spark SQL in Python
Mark Plutowski
Data Scientist
train_id | station | time |
---|---|---|
324 | San Francisco | 7:59 |
324 | 22nd Street | 8:03 |
324 | Millbrae | 8:16 |
324 | Hillsdale | 8:24 |
324 | Redwood City | 8:31 |
324 | Palo Alto | 8:37 |
324 | San Jose | 9:05 |
train_id | station | time | time_to_next_stop |
---|---|---|---|
324 | San Francisco | 7:59 | 4 min |
324 | 22nd Street | 8:03 | 13 min |
324 | Millbrae | 8:16 | 8 min |
324 | Hillsdale | 8:24 | 7 min |
324 | Redwood City | 8:31 | 6 min |
324 | Palo Alto | 8:37 | 28 min |
324 | San Jose | 9:05 | null |
train_id | station | time | time (following row) |
---|---|---|---|
324 | San Francisco | 7:59 | 8:03 |
324 | 22nd Street | 8:03 | 8:16 |
324 | Millbrae | 8:16 | 8:24 |
324 | Hillsdale | 8:24 | 8:31 |
324 | Redwood City | 8:31 | 8:37 |
324 | Palo Alto | 8:37 | 9:05 |
324 | San Jose | 9:05 | null |
query = """ SELECT train_id, station, time, LEAD(time, 1) OVER (ORDER BY time) AS time_next FROM sched WHERE train_id=324 """
spark.sql(query).show()
+--------+-------------+-----+---------+
|train_id| station| time|time_next|
+--------+-------------+-----+---------+
| 324|San Francisco|7:59 | 8:03 |
| 324| 22nd Street|8:03 | 8:16 |
| 324| Millbrae|8:16 | 8:24 |
| 324| Hillsdale|8:24 | 8:31 |
| 324| Redwood City|8:31 | 8:37 |
| 324| Palo Alto|8:37 | 9:05 |
| 324| San Jose|9:05 | null |
+--------+-------------+-----+---------+
SELECT
train_id,
station,
time,
LEAD(time,1) OVER (PARTITION BY train_id ORDER BY time) AS time_next
FROM sched
+--------+-------------+-----+---------+
|train_id| station| time|time_next|
+--------+-------------+-----+---------+
| 217| Gilroy|6:06 | 6:15 |
| 217| San Martin|6:15 | 6:21 |
| 217| Morgan Hill|6:21 | 6:36 |
| 217| Blossom Hill|6:36 | 6:42 |
| 217| Capitol|6:42 | 6:50 |
| 217| Tamien|6:50 | 6:59 |
| 217| San Jose|6:59 | null |
| 324|San Francisco|7:59 | 8:03 |
| 324| 22nd Street|8:03 | 8:16 |
| 324| Millbrae|8:16 | 8:24 |
| 324| Hillsdale|8:24 | 8:31 |
| 324| Redwood City|8:31 | 8:37 |
| 324| Palo Alto|8:37 | 9:05 |
| 324| San Jose|9:05 | null |
+--------+-------------+-----+---------+
train_id | station | time | time_to_next_stop |
---|---|---|---|
324 | San Francisco | 7:59 | 4 min |
324 | 22nd Street | 8:03 | 13 min |
324 | Millbrae | 8:16 | 8 min |
324 | Hillsdale | 8:24 | 7 min |
324 | Redwood City | 8:31 | 6 min |
324 | Palo Alto | 8:37 | 28 min |
324 | San Jose | 9:05 | null |
Introduction to Spark SQL in Python