Pencere Fonksiyonu SQL

Python ile Spark SQL'e Giriş

Mark Plutowski

Data Scientist

Pencere Fonksiyonu SQL nedir?

  • Nokta gösterimi veya sorgulardan daha basit ifade
  • Her satır, değerini hesaplamak için diğer satırların değerlerini kullanır
Python ile Spark SQL'e Giriş

Bir tren çizelgesi

train_id station time
324 San Francisco 7:59
324 22nd Street 8:03
324 Millbrae 8:16
324 Hillsdale 8:24
324 Redwood City 8:31
324 Palo Alto 8:37
324 San Jose 9:05
Python ile Spark SQL'e Giriş

Sonraki durağa kadar süre sütunu eklendi

train_id station time time_to_next_stop
324 San Francisco 7:59 4 dk
324 22nd Street 8:03 13 dk
324 Millbrae 8:16 8 dk
324 Hillsdale 8:24 7 dk
324 Redwood City 8:31 6 dk
324 Palo Alto 8:37 28 dk
324 San Jose 9:05 null
Python ile Spark SQL'e Giriş

Sonraki durağın zamanı sütunu

train_id station time time (sonraki satır)
324 San Francisco 7:59 8:03
324 22nd Street 8:03 8:16
324 Millbrae 8:16 8:24
324 Hillsdale 8:24 8:31
324 Redwood City 8:31 8:37
324 Palo Alto 8:37 9:05
324 San Jose 9:05 null
Python ile Spark SQL'e Giriş

OVER ve ORDER BY ifadeleri

query = """
SELECT train_id, station, time, 
LEAD(time, 1) OVER (ORDER BY time) AS time_next 
FROM sched 
WHERE train_id=324 """

spark.sql(query).show()
+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Python ile Spark SQL'e Giriş

PARTITION BY ifadesi

SELECT 
train_id, 
station, 
time, 
LEAD(time,1) OVER (PARTITION BY train_id ORDER BY time) AS time_next 
FROM sched 
Python ile Spark SQL'e Giriş

PARTITION BY eklenince sonuç

+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     217|       Gilroy|6:06 |    6:15 |
|     217|   San Martin|6:15 |    6:21 |
|     217|  Morgan Hill|6:21 |    6:36 |
|     217| Blossom Hill|6:36 |    6:42 |
|     217|      Capitol|6:42 |    6:50 |
|     217|       Tamien|6:50 |    6:59 |
|     217|     San Jose|6:59 |    null |
|     324|San Francisco|7:59 |    8:03 |
|     324|  22nd Street|8:03 |    8:16 |
|     324|     Millbrae|8:16 |    8:24 |
|     324|    Hillsdale|8:24 |    8:31 |
|     324| Redwood City|8:31 |    8:37 |
|     324|    Palo Alto|8:37 |    9:05 |
|     324|     San Jose|9:05 |    null |
+--------+-------------+-----+---------+
Python ile Spark SQL'e Giriş
train_id station time time_to_next_stop
324 San Francisco 7:59 4 dk
324 22nd Street 8:03 13 dk
324 Millbrae 8:16 8 dk
324 Hillsdale 8:24 7 dk
324 Redwood City 8:31 6 dk
324 Palo Alto 8:37 28 dk
324 San Jose 9:05 null
Python ile Spark SQL'e Giriş

Hadi pratik yapalım

Python ile Spark SQL'e Giriş

Preparing Video For Download...