Introductie tot Spark SQL in Python
Mark Plutowski
Data Scientist
| trein_id | station | tijd |
|---|---|---|
| 324 | San Francisco | 7:59 |
| 324 | 22nd Street | 8:03 |
| 324 | Millbrae | 8:16 |
| 324 | Hillsdale | 8:24 |
| 324 | Redwood City | 8:31 |
| 324 | Palo Alto | 8:37 |
| 324 | San Jose | 9:05 |
| trein_id | station | tijd | tijd_tot_volgende_stop |
|---|---|---|---|
| 324 | San Francisco | 7:59 | 4 min |
| 324 | 22nd Street | 8:03 | 13 min |
| 324 | Millbrae | 8:16 | 8 min |
| 324 | Hillsdale | 8:24 | 7 min |
| 324 | Redwood City | 8:31 | 6 min |
| 324 | Palo Alto | 8:37 | 28 min |
| 324 | San Jose | 9:05 | null |
| trein_id | station | tijd | tijd (volgende rij) |
|---|---|---|---|
| 324 | San Francisco | 7:59 | 8:03 |
| 324 | 22nd Street | 8:03 | 8:16 |
| 324 | Millbrae | 8:16 | 8:24 |
| 324 | Hillsdale | 8:24 | 8:31 |
| 324 | Redwood City | 8:31 | 8:37 |
| 324 | Palo Alto | 8:37 | 9:05 |
| 324 | San Jose | 9:05 | null |
query = """ SELECT train_id, station, time, LEAD(time, 1) OVER (ORDER BY time) AS time_next FROM sched WHERE train_id=324 """spark.sql(query).show()
+--------+-------------+-----+---------+
|train_id| station| time|time_next|
+--------+-------------+-----+---------+
| 324|San Francisco|7:59 | 8:03 |
| 324| 22nd Street|8:03 | 8:16 |
| 324| Millbrae|8:16 | 8:24 |
| 324| Hillsdale|8:24 | 8:31 |
| 324| Redwood City|8:31 | 8:37 |
| 324| Palo Alto|8:37 | 9:05 |
| 324| San Jose|9:05 | null |
+--------+-------------+-----+---------+
SELECT
trein_id,
station,
tijd,
LEAD(tijd,1) OVER (PARTITION BY trein_id ORDER BY tijd) AS tijd_volgende
FROM sched
+--------+-------------+-----+-----------+
|train_id| station| time|time_next |
+--------+-------------+-----+-----------+
| 217| Gilroy|6:06 | 6:15 |
| 217| San Martin|6:15 | 6:21 |
| 217| Morgan Hill|6:21 | 6:36 |
| 217| Blossom Hill|6:36 | 6:42 |
| 217| Capitol|6:42 | 6:50 |
| 217| Tamien|6:50 | 6:59 |
| 217| San Jose|6:59 | null |
| 324|San Francisco|7:59 | 8:03 |
| 324| 22nd Street|8:03 | 8:16 |
| 324| Millbrae|8:16 | 8:24 |
| 324| Hillsdale|8:24 | 8:31 |
| 324| Redwood City|8:31 | 8:37 |
| 324| Palo Alto|8:37 | 9:05 |
| 324| San Jose|9:05 | null |
+--------+-------------+-----+-----------+
| trein_id | station | tijd | tijd_tot_volgende_stop |
|---|---|---|---|
| 324 | San Francisco | 7:59 | 4 min |
| 324 | 22nd Street | 8:03 | 13 min |
| 324 | Millbrae | 8:16 | 8 min |
| 324 | Hillsdale | 8:24 | 7 min |
| 324 | Redwood City | 8:31 | 6 min |
| 324 | Palo Alto | 8:37 | 28 min |
| 324 | San Jose | 9:05 | null |
Introductie tot Spark SQL in Python