Preparing for Random Forest Regression

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Assumptions Needed for Features

Random Forest Regression

  • Skewed/Non Normal Data? OK
  • Unscaled? OK
  • Missing Data? OK
  • Categorical Data? OK

Assumptions

Feature Engineering with PySpark

Appended Features

Economic

  • 30 Year Mortgage Rates

Governmental

  • Median Home Price for City
  • Home Age Percentages for City
  • Home Size Percentages for City

Social

  • Walk Score
  • Bike Score

Seasonal

  • Bank Holidays
Feature Engineering with PySpark

Engineered Features

Temporal Features

  • Limited value with one year of data
  • Holiday Weeks

Rates, Ratios, Sums

  • Business Context
  • Personal Context

Expanded Features

  • Non-Free Form Text Columns
  • Need to Remove Low Observations
# What is shape of our data?
print((df.count(), len(df.columns)))
(5000, 126)
Feature Engineering with PySpark

Dataframe Columns to Feature Vectors

from pyspark.ml.feature import VectorAssembler
# Replace Missing values
df = df.fillna(-1)
# Define the columns to be converted to vectors
features_cols = list(df.columns)
# Remove the dependent variable from the list
features_cols.remove('SALESCLOSEPRICE')
Feature Engineering with PySpark

Dataframe Columns to Feature Vectors

# Create the vector assembler transformer
vec = VectorAssembler(inputCols=features_cols, outputCol='features')

# Apply the vector transformer to data df = vec.transform(df)
# Select only the feature vectors and the dependent variable ml_ready_df = df.select(['SALESCLOSEPRICE', 'features'])
# Inspect Results ml_ready_df.show(5)
+----------------+--------------------+
| SALESCLOSEPRICE|            features|
+----------------+--------------------+
|143000          |(125,[0,1,2,3,5,6...|
|190000          |(125,[0,1,2,3,5,6...|
|225000          |(125,[0,1,2,3,5,6...|
|265000          |(125,[0,1,2,3,4,5...|
|249900          |(125,[0,1,2,3,4,5...|
+----------------+--------------------+
only showing top 5 rows
Feature Engineering with PySpark

We are now ready for machine learning!

Feature Engineering with PySpark

Preparing Video For Download...