Extract Transform Select

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

ETS

Extract Transform Selec

Extract, Transform, and Select

Extraction
Transformation
Selection

Built-in functions

from pyspark.sql.functions import split, explode

The length function

from pyspark.sql.functions import length

df.where(length('sentence') == 0)

Creating a custom function

User Defined Function
UDF

Importing the udf function

from pyspark.sql.functions import udf

Creating a boolean UDF

print(df)

DataFrame[textdata: string]

from pyspark.sql.functions import udf

from pyspark.sql.types import BooleanType

Creating a boolean UDF

short_udf = udf(lambda x: 
                          True if not x or len(x) < 10 else False, 
                          BooleanType())

df.select(short_udf('textdata')\
  .alias("is short"))\
  .show(3)

+--------+
|is short|
+--------+
|   false|
|    true|
|   false|
+--------+

Important UDF return types

from pyspark.sql.types import StringType, IntegerType, FloatType, ArrayType

Creating an array UDF

df3.select('word array', in_udf('word array').alias('without endword'))\
   .show(5, truncate=30)

+-----------------------------+----------------------+
|                   word array|       without endword|
+-----------------------------+----------------------+
|[then, how, many, are, there]|[then, how, many, are]|
|                  [how, many]|                 [how]|
|             [i, donot, know]|            [i, donot]|
|                  [quite, so]|               [quite]|
|   [you, have, not, observed]|      [you, have, not]|
+-----------------------------+----------------------+

Creating an array UDF

from pyspark.sql.types import StringType, ArrayType

# Removes last item in array
in_udf = udf(lambda x: 
    x[0:len(x)-1] if x and len(x) > 1 
    else [], 
    ArrayType(StringType()))

Sparse vector format

Indices
Values

Example:

Array: [1.0, 0.0, 0.0, 3.0]
Sparse vector: (4, [0, 3], [1.0, 3.0])

Working with vector data

hasattr(x, "toArray")
x.numNonzeros())

Let's practice!

Introduction to Spark SQL in Python