Feature Generation

Feature Engineering with PySpark

John Hogue

Lead Data Scientist

Why generate new features?

Generation

Multiplying

Summing

Differencing

Dividing

Feature Engineering with PySpark

Why generate new features?

Length Linear Model Plot

Width Linear Model Plot

Feature Engineering with PySpark

Combining Two Features

Multiplication

# Creating a new feature, area by multiplying
df = df.withColumn('TSQFT', (df['WIDTH'] * df['LENGTH']))

Area Linear Model Plot

Feature Engineering with PySpark

Other Ways to Combine Two Features

# Sum two columns
df = df.withColumn('TSQFT', (df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND']))
# Divide two columns
df = df.withColumn('PRICEPERTSQFT', (df['LISTPRICE'] / df['TSQFT']))
# Difference two columns
df = df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE'))
Feature Engineering with PySpark

What's the limit?

Automation of Features

  • FeatureTools & TSFresh
  • Explosion of Features
  • Higher Order & Beyond?

Futuristic Image

Feature Engineering with PySpark

Go forth and combine!

Feature Engineering with PySpark

Preparing Video For Download...