Extract Transform Select

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

ETS

Introduction to Spark SQL in Python

Extract Transform Selec

Introduction to Spark SQL in Python

Extract, Transform, and Select

  • Extraction
  • Transformation
  • Selection
Introduction to Spark SQL in Python

Built-in functions

from pyspark.sql.functions import split, explode
Introduction to Spark SQL in Python

The length function

from pyspark.sql.functions import length
df.where(length('sentence') == 0)
Introduction to Spark SQL in Python

Creating a custom function

  • User Defined Function
  • UDF
Introduction to Spark SQL in Python

Importing the udf function

from pyspark.sql.functions import udf
Introduction to Spark SQL in Python

Creating a boolean UDF

print(df)
DataFrame[textdata: string]
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
Introduction to Spark SQL in Python

Creating a boolean UDF

short_udf = udf(lambda x: 
                          True if not x or len(x) < 10 else False, 
                          BooleanType())
df.select(short_udf('textdata')\
  .alias("is short"))\
  .show(3)
+--------+
|is short|
+--------+
|   false|
|    true|
|   false|
+--------+
Introduction to Spark SQL in Python

Important UDF return types

from pyspark.sql.types import StringType, IntegerType, FloatType, ArrayType
Introduction to Spark SQL in Python

Creating an array UDF

df3.select('word array', in_udf('word array').alias('without endword'))\
   .show(5, truncate=30)
+-----------------------------+----------------------+
|                   word array|       without endword|
+-----------------------------+----------------------+
|[then, how, many, are, there]|[then, how, many, are]|
|                  [how, many]|                 [how]|
|             [i, donot, know]|            [i, donot]|
|                  [quite, so]|               [quite]|
|   [you, have, not, observed]|      [you, have, not]|
+-----------------------------+----------------------+
Introduction to Spark SQL in Python

Creating an array UDF

from pyspark.sql.types import StringType, ArrayType
# Removes last item in array
in_udf = udf(lambda x: 
    x[0:len(x)-1] if x and len(x) > 1 
    else [], 
    ArrayType(StringType()))
Introduction to Spark SQL in Python

Sparse vector format

  1. Indices
  2. Values

Example:

  • Array: [1.0, 0.0, 0.0, 3.0]
  • Sparse vector: (4, [0, 3], [1.0, 3.0])
Introduction to Spark SQL in Python

Working with vector data

  • hasattr(x, "toArray")
  • x.numNonzeros())
Introduction to Spark SQL in Python

Let's practice!

Introduction to Spark SQL in Python

Preparing Video For Download...