Extracting Features

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Extracting Age with Text Match

ROOF
Asphalt Shingles, Pitched, Age 8 Years or Less
Asphalt Shingles, Age Over 8 Years
Asphalt Shingles, Age 8 Years or Less
Asphalt Shingles

Roof_Age	becomes	Roof>8yrs
Age 8 Years or Less	?	0
Age Over 8 Years	?	1
Age 8 Years or Less	?	0
NULL	?	NULL

Extracting Age with Text Match

from pyspark.sql.functions import when

# Create boolean filters
find_under_8 = df['ROOF'].like('%Age 8 Years or Less%')
find_over_8 = df['ROOF'].like('%Age Over 8 Years%')

# Apply filters using when() and otherwise()
df = df.withColumn('old_roof', (when(find_over_8, 1)
                               .when(find_under_8, 0)
                               .otherwise(None)))

# Inspect results
df[['ROOF', 'old_roof']].show(3, truncate=100)

+----------------------------------------------+--------+
|                                          ROOF|old_roof|
+----------------------------------------------+--------+
|                                          null|    null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|       0|
|            Asphalt Shingles, Age Over 8 Years|       1|
+----------------------------------------------+--------+
only showing top 3 rows

Splitting Columns

ROOF	becomes	Roof_Material
Asphalt Shingles, Pitched, Age 8 Years or Less	?	Asphalt Shingles
Null	?
Asphalt Shingles, Age Over 8 Years	?	Asphalt Shingles
Metal, Age 8 Years or Less	?	Metal
Tile, Age 8 Years or Less	?	Tile
Asphalt Shingles	?	Asphalt Shingles

Splitting Columns

from pyspark.sql.functions import split

# Split the column on commas into a list
split_col = split(df['ROOF'], ',')

# Put the first value of the list into a new column
df = df.withColumn('Roof_Material', split_col.getItem(0))

# Inspect results
df[['ROOF', 'Roof_Material']].show(5, truncate=100)

+----------------------------------------------+----------------+
|                                          ROOF|   Roof_Material|
+----------------------------------------------+----------------+
|                                          null|            null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
|                                          null|            null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
|            Asphalt Shingles, Age Over 8 Years|Asphalt Shingles|
+----------------------------------------------+----------------+
only showing top 5 rows

Explode!

Starting Record

NO	roof_list
2	[Asphalt Shingles, Pitched, Age 8 Years or Less]

Exploded Record

NO	ex_roof_list
2	Asphalt Shingles
2	Pitched
2	Age 8 Years or Less

Pivot!

Exploded Record

NO	ex_roof_list
2	Asphalt Shingles
2	Pitched
2	Age 8 Years or Less

Pivoted Record

NO	Age 8 Years or Less	Age Over 8 Years	Asphalt Shingles	Flat	Metal	Other	Pitched	...
2	0	1	1	0	0	0	1	...

Explode & Pivot!

from pyspark.sql.functions import split, explode, lit, coalesce, first

# Split the column on commas into a list
df = df.withColumn('roof_list', split(df['ROOF'], ', '))

# Explode list into new records for each value
ex_df = df.withColumn('ex_roof_list', explode(df['roof_list']))

# Create a dummy column of constant value
ex_df = ex_df.withColumn('constant_val', lit(1))

# Pivot the values into boolean columns
piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\
  .agg(coalesce(first('constant_val')))

Let's wrangle some features!

Feature Engineering with PySpark