Extracting Features

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Extracting Age with Text Match

ROOF
Asphalt Shingles, Pitched, Age 8 Years or Less
Asphalt Shingles, Age Over 8 Years
Asphalt Shingles, Age 8 Years or Less
Asphalt Shingles

Roof_Age becomes Roof>8yrs
Age 8 Years or Less ? 0
Age Over 8 Years ? 1
Age 8 Years or Less ? 0
NULL ? NULL

Feature Engineering with PySpark

Extracting Age with Text Match

from pyspark.sql.functions import when

# Create boolean filters find_under_8 = df['ROOF'].like('%Age 8 Years or Less%') find_over_8 = df['ROOF'].like('%Age Over 8 Years%')
# Apply filters using when() and otherwise() df = df.withColumn('old_roof', (when(find_over_8, 1) .when(find_under_8, 0) .otherwise(None)))
# Inspect results df[['ROOF', 'old_roof']].show(3, truncate=100)
+----------------------------------------------+--------+
|                                          ROOF|old_roof|
+----------------------------------------------+--------+
|                                          null|    null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|       0|
|            Asphalt Shingles, Age Over 8 Years|       1|
+----------------------------------------------+--------+
only showing top 3 rows
Feature Engineering with PySpark

Splitting Columns

ROOF becomes Roof_Material
Asphalt Shingles, Pitched, Age 8 Years or Less ? Asphalt Shingles
Null ?
Asphalt Shingles, Age Over 8 Years ? Asphalt Shingles
Metal, Age 8 Years or Less ? Metal
Tile, Age 8 Years or Less ? Tile
Asphalt Shingles ? Asphalt Shingles
Feature Engineering with PySpark

Splitting Columns

from pyspark.sql.functions import split

# Split the column on commas into a list split_col = split(df['ROOF'], ',')
# Put the first value of the list into a new column df = df.withColumn('Roof_Material', split_col.getItem(0))
# Inspect results df[['ROOF', 'Roof_Material']].show(5, truncate=100)
+----------------------------------------------+----------------+
|                                          ROOF|   Roof_Material|
+----------------------------------------------+----------------+
|                                          null|            null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
|                                          null|            null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
|            Asphalt Shingles, Age Over 8 Years|Asphalt Shingles|
+----------------------------------------------+----------------+
only showing top 5 rows
Feature Engineering with PySpark

Explode!

Starting Record

NO roof_list
2 [Asphalt Shingles, Pitched, Age 8 Years or Less]

Exploded Record

NO ex_roof_list
2 Asphalt Shingles
2 Pitched
2 Age 8 Years or Less
Feature Engineering with PySpark

Pivot!

Exploded Record

NO ex_roof_list
2 Asphalt Shingles
2 Pitched
2 Age 8 Years or Less

Pivoted Record

NO Age 8 Years or Less Age Over 8 Years Asphalt Shingles Flat Metal Other Pitched ...
2 0 1 1 0 0 0 1 ...
Feature Engineering with PySpark

Explode & Pivot!

from pyspark.sql.functions import split, explode, lit, coalesce, first
# Split the column on commas into a list
df = df.withColumn('roof_list', split(df['ROOF'], ', '))
# Explode list into new records for each value
ex_df = df.withColumn('ex_roof_list', explode(df['roof_list']))
# Create a dummy column of constant value
ex_df = ex_df.withColumn('constant_val', lit(1))
# Pivot the values into boolean columns
piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\
  .agg(coalesce(first('constant_val')))
Feature Engineering with PySpark

Let's wrangle some features!

Feature Engineering with PySpark

Preparing Video For Download...