Nokta gösterimi ve SQL

Python ile Spark SQL'e Giriş

Mark Plutowski

Data Scientist

Tablomuzda 3 sütun var

df.columns
['train_id', 'station', 'time']
df.show(5)
+--------+-------------+-----+
|train_id|      station| time|
+--------+-------------+-----+
|     324|San Francisco|7:59 |
|     324|  22nd Street|8:03 |
|     324|     Millbrae|8:16 |
|     324|    Hillsdale|8:24 |
|     324| Redwood City|8:31 |
+--------+-------------+-----+
Python ile Spark SQL'e Giriş

Yalnızca 2’si gerekli

df.select('train_id','station')
  .show(5)
+--------+-------------+
|train_id|      station|
+--------+-------------+
|     324|San Francisco|
|     324|  22nd Street|
|     324|     Millbrae|
|     324|    Hillsdale|
|     324| Redwood City|
+--------+-------------+
Python ile Spark SQL'e Giriş

2 sütunu seçmenin üç yolu

  • df.select('train_id', 'station')
  • df.select(df.train_id, df.station)
  • from pyspark.sql.functions import col
  • df.select(col('train_id'), col('station'))
Python ile Spark SQL'e Giriş

Sütun yeniden adlandırmanın iki yolu

df.select('train_id','station')
  .withColumnRenamed('train_id','train')
  .show(5)
+-----+-------------+
|train|      station|
+-----+-------------+
|  324|San Francisco|
|  324|  22nd Street|
|  324|     Millbrae|
|  324|    Hillsdale|
|  324| Redwood City|
+-----+-------------+
df.select(col('train_id').alias('train'), 'station')
Python ile Spark SQL'e Giriş

Bunu yapmayın!

df.select('train_id', df.station, col('time'))

Python ile Spark SQL'e Giriş

Nokta gösterimi ile SQL sorguları

spark.sql('SELECT train_id AS train, station FROM schedule LIMIT 5')
     .show()
+-----+-------------+
|train|      station|
+-----+-------------+
|  324|San Francisco|
|  324|  22nd Street|
|  324|     Millbrae|
|  324|    Hillsdale|
|  324| Redwood City|
+-----+-------------+
df.select(col('train_id').alias('train'), 'station')
  .limit(5)
  .show()
Python ile Spark SQL'e Giriş

Pencere fonksiyonu SQL

query = """
SELECT *, 
ROW_NUMBER() OVER(PARTITION BY train_id ORDER BY time) AS id 
FROM schedule
"""
spark.sql(query)
     .show(11)
Python ile Spark SQL'e Giriş

Pencere fonksiyonu SQL

+--------+-------------+-----+---+
|train_id|      station| time| id|
+--------+-------------+-----+---+
|     217|       Gilroy|6:06 |  1|
|     217|   San Martin|6:15 |  2|
|     217|  Morgan Hill|6:21 |  3|
|     217| Blossom Hill|6:36 |  4|
|     217|      Capitol|6:42 |  5|
|     217|       Tamien|6:50 |  6|
|     217|     San Jose|6:59 |  7|
|     324|San Francisco|7:59 |  1|
|     324|  22nd Street|8:03 |  2|
|     324|     Millbrae|8:16 |  3|
|     324|    Hillsdale|8:24 |  4|
+--------+-------------+-----+---+
Python ile Spark SQL'e Giriş

Nokta gösterimiyle pencere fonksiyonu

from pyspark.sql import Window, 
from pyspark.sql.functions import row_number
df.withColumn("id", row_number()
                    .over(
                           Window.partitionBy('train_id')
                                 .orderBy('time')
                         )
  )
  • SQL'deki ROW_NUMBER : pyspark.sql.functions.row_number
  • OVER ifadesinin içi : pyspark.sql.Window
  • PARTITION BY : pyspark.sql.Window.partitionBy
  • ORDER BY : pyspark.sql.Window.orderBy
Python ile Spark SQL'e Giriş

WindowSpec kullanımı

  • Spark SQL'deki over fonksiyonu, SQL'deki OVER ifadesine karşılık gelir.
  • pyspark.sql.window.Window sınıfı, OVER ifadesinin içini temsil eder.
window = Window.partitionBy('train_id').orderBy('time')
dfx = df.withColumn('next', lead('time',1).over(window))
  • Yukarıda, type(window) = pyspark.sql.window.WindowSpec
Python ile Spark SQL'e Giriş

Hadi pratik yapalım

Python ile Spark SQL'e Giriş

Preparing Video For Download...