Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
ROOF |
---|
Asphalt Shingles, Pitched, Age 8 Years or Less |
Asphalt Shingles, Age Over 8 Years |
Asphalt Shingles, Age 8 Years or Less |
Asphalt Shingles |
Roof_Age | becomes | Roof>8yrs |
---|---|---|
Age 8 Years or Less | ? | 0 |
Age Over 8 Years | ? | 1 |
Age 8 Years or Less | ? | 0 |
NULL | ? | NULL |
from pyspark.sql.functions import when
# Create boolean filters find_under_8 = df['ROOF'].like('%Age 8 Years or Less%') find_over_8 = df['ROOF'].like('%Age Over 8 Years%')
# Apply filters using when() and otherwise() df = df.withColumn('old_roof', (when(find_over_8, 1) .when(find_under_8, 0) .otherwise(None)))
# Inspect results df[['ROOF', 'old_roof']].show(3, truncate=100)
+----------------------------------------------+--------+
| ROOF|old_roof|
+----------------------------------------------+--------+
| null| null|
|Asphalt Shingles, Pitched, Age 8 Years or Less| 0|
| Asphalt Shingles, Age Over 8 Years| 1|
+----------------------------------------------+--------+
only showing top 3 rows
ROOF | becomes | Roof_Material |
---|---|---|
Asphalt Shingles, Pitched, Age 8 Years or Less | ? | Asphalt Shingles |
Null | ? | |
Asphalt Shingles, Age Over 8 Years | ? | Asphalt Shingles |
Metal, Age 8 Years or Less | ? | Metal |
Tile, Age 8 Years or Less | ? | Tile |
Asphalt Shingles | ? | Asphalt Shingles |
from pyspark.sql.functions import split
# Split the column on commas into a list split_col = split(df['ROOF'], ',')
# Put the first value of the list into a new column df = df.withColumn('Roof_Material', split_col.getItem(0))
# Inspect results df[['ROOF', 'Roof_Material']].show(5, truncate=100)
+----------------------------------------------+----------------+
| ROOF| Roof_Material|
+----------------------------------------------+----------------+
| null| null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
| null| null|
|Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles|
| Asphalt Shingles, Age Over 8 Years|Asphalt Shingles|
+----------------------------------------------+----------------+
only showing top 5 rows
Starting Record
NO | roof_list |
---|---|
2 | [Asphalt Shingles, Pitched, Age 8 Years or Less] |
Exploded Record
NO | ex_roof_list |
---|---|
2 | Asphalt Shingles |
2 | Pitched |
2 | Age 8 Years or Less |
Exploded Record
NO | ex_roof_list |
---|---|
2 | Asphalt Shingles |
2 | Pitched |
2 | Age 8 Years or Less |
Pivoted Record
NO | Age 8 Years or Less | Age Over 8 Years | Asphalt Shingles | Flat | Metal | Other | Pitched | ... |
---|---|---|---|---|---|---|---|---|
2 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | ... |
from pyspark.sql.functions import split, explode, lit, coalesce, first
# Split the column on commas into a list
df = df.withColumn('roof_list', split(df['ROOF'], ', '))
# Explode list into new records for each value
ex_df = df.withColumn('ex_roof_list', explode(df['roof_list']))
# Create a dummy column of constant value
ex_df = ex_df.withColumn('constant_val', lit(1))
# Pivot the values into boolean columns
piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\
.agg(coalesce(first('constant_val')))
Feature Engineering with PySpark