Feature Engineering with PySpark
John Hogue
Lead Data Scientist
Multiplying
Summing
Differencing
Dividing
Multiplication
# Creating a new feature, area by multiplying
df = df.withColumn('TSQFT', (df['WIDTH'] * df['LENGTH']))
# Sum two columns
df = df.withColumn('TSQFT', (df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND']))
# Divide two columns
df = df.withColumn('PRICEPERTSQFT', (df['LISTPRICE'] / df['TSQFT']))
# Difference two columns
df = df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE'))
Automation of Features
Feature Engineering with PySpark