Extract Transform Select

Introduzione a Spark SQL in Python

Mark Plutowski

Data Scientist

ETS

Estrai Trasforma Seleziona

Extract, Transform e Select

Estrazione
Trasformazione
Selezione

Funzioni integrate

from pyspark.sql.functions import split, explode

La funzione length

from pyspark.sql.functions import length

df.where(length('sentence') == 0)

Creare una funzione personalizzata

Funzione definita dall'utente
UDF

Importare la funzione udf

from pyspark.sql.functions import udf

Creare una UDF booleana

print(df)

DataFrame[textdata: string]

from pyspark.sql.functions import udf

from pyspark.sql.types import BooleanType

Creare una UDF booleana

short_udf = udf(lambda x: 
                          True if not x or len(x) < 10 else False, 
                          BooleanType())

df.select(short_udf('textdata')\
  .alias("is short"))\
  .show(3)

+--------+
|is short|
+--------+
|   false|
|    true|
|   false|
+--------+

Tipi di ritorno UDF importanti

from pyspark.sql.types import StringType, IntegerType, FloatType, ArrayType

Creare una UDF per array

df3.select('word array', in_udf('word array').alias('without endword'))\
   .show(5, truncate=30)

+-----------------------------+----------------------+
|                   word array|       without endword|
+-----------------------------+----------------------+
|[then, how, many, are, there]|[then, how, many, are]|
|                  [how, many]|                 [how]|
|             [i, donot, know]|            [i, donot]|
|                  [quite, so]|               [quite]|
|   [you, have, not, observed]|      [you, have, not]|
+-----------------------------+----------------------+

Creare una UDF per array

from pyspark.sql.types import StringType, ArrayType

# Removes last item in array
in_udf = udf(lambda x: 
    x[0:len(x)-1] if x and len(x) > 1 
    else [], 
    ArrayType(StringType()))

Formato vettore sparso

Indici
Valori

Esempio:

Array: [1.0, 0.0, 0.0, 3.0]
Vettore sparso: (4, [0, 3], [1.0, 3.0])

Lavorare con dati vettoriali

hasattr(x, "toArray")
x.numNonzeros())

Vamos praticar!

Introduzione a Spark SQL in Python