Fungsi Window SQL

Pengantar Spark SQL dalam Python

Mark Plutowski

Data Scientist

Apa itu Fungsi Window SQL?

  • Menyederhanakan operasi dibanding notasi titik atau kueri
  • Tiap baris memakai nilai baris lain untuk menghitung nilainya
Pengantar Spark SQL dalam Python

Jadwal kereta

train_id station time
324 San Francisco 7:59
324 22nd Street 8:03
324 Millbrae 8:16
324 Hillsdale 8:24
324 Redwood City 8:31
324 Palo Alto 8:37
324 San Jose 9:05
Pengantar Spark SQL dalam Python

Kolom waktu hingga pemberhentian berikutnya ditambahkan

train_id station time time_to_next_stop
324 San Francisco 7:59 4 mnt
324 22nd Street 8:03 13 mnt
324 Millbrae 8:16 8 mnt
324 Hillsdale 8:24 7 mnt
324 Redwood City 8:31 6 mnt
324 Palo Alto 8:37 28 mnt
324 San Jose 9:05 null
Pengantar Spark SQL dalam Python

Kolom waktu pemberhentian selanjutnya

train_id station time time (baris berikut)
324 San Francisco 7:59 8:03
324 22nd Street 8:03 8:16
324 Millbrae 8:16 8:24
324 Hillsdale 8:24 8:31
324 Redwood City 8:31 8:37
324 Palo Alto 8:37 9:05
324 San Jose 9:05 null
Pengantar Spark SQL dalam Python

Klausa OVER dan ORDER BY

query = """
SELECT train_id, station, time, 
LEAD(time, 1) OVER (ORDER BY time) AS time_next 
FROM sched 
WHERE train_id=324 """

spark.sql(query).show()
+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Pengantar Spark SQL dalam Python

Klausa PARTITION BY

SELECT 
train_id, 
station, 
time, 
LEAD(time,1) OVER (PARTITION BY train_id ORDER BY time) AS time_next 
FROM sched 
Pengantar Spark SQL dalam Python

Hasil dengan klausa PARTITION BY

+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     217|       Gilroy|6:06 |    6:15 |
|     217|   San Martin|6:15 |    6:21 |
|     217|  Morgan Hill|6:21 |    6:36 |
|     217| Blossom Hill|6:36 |    6:42 |
|     217|      Capitol|6:42 |    6:50 |
|     217|       Tamien|6:50 |    6:59 |
|     217|     San Jose|6:59 |    null |
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Pengantar Spark SQL dalam Python
train_id station time time_to_next_stop
324 San Francisco 7:59 4 mnt
324 22nd Street 8:03 13 mnt
324 Millbrae 8:16 8 mnt
324 Hillsdale 8:24 7 mnt
324 Redwood City 8:31 6 mnt
324 Palo Alto 8:37 28 mnt
324 San Jose 9:05 null
Pengantar Spark SQL dalam Python

Ayo berlatih!

Pengantar Spark SQL dalam Python

Preparing Video For Download...