Preparing for Random Forest Regression

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Assumptions Needed for Features

Random Forest Regression

Skewed/Non Normal Data? OK
Unscaled? OK
Missing Data? OK
Categorical Data? OK

Assumptions

Appended Features

Economic

30 Year Mortgage Rates

Governmental

Median Home Price for City
Home Age Percentages for City
Home Size Percentages for City

Social

Walk Score
Bike Score

Seasonal

Bank Holidays

Engineered Features

Temporal Features

Limited value with one year of data
Holiday Weeks

Rates, Ratios, Sums

Business Context
Personal Context

Expanded Features

Non-Free Form Text Columns
Need to Remove Low Observations

# What is shape of our data?
print((df.count(), len(df.columns)))

(5000, 126)

Dataframe Columns to Feature Vectors

from pyspark.ml.feature import VectorAssembler

# Replace Missing values
df = df.fillna(-1)

# Define the columns to be converted to vectors
features_cols = list(df.columns)

# Remove the dependent variable from the list
features_cols.remove('SALESCLOSEPRICE')

Dataframe Columns to Feature Vectors

# Create the vector assembler transformer
vec = VectorAssembler(inputCols=features_cols, outputCol='features')

# Apply the vector transformer to data
df = vec.transform(df)

# Select only the feature vectors and the dependent variable
ml_ready_df = df.select(['SALESCLOSEPRICE', 'features'])

# Inspect Results
ml_ready_df.show(5)

+----------------+--------------------+
| SALESCLOSEPRICE|            features|
+----------------+--------------------+
|143000          |(125,[0,1,2,3,5,6...|
|190000          |(125,[0,1,2,3,5,6...|
|225000          |(125,[0,1,2,3,5,6...|
|265000          |(125,[0,1,2,3,4,5...|
|249900          |(125,[0,1,2,3,4,5...|
+----------------+--------------------+
only showing top 5 rows

We are now ready for machine learning!

Feature Engineering with PySpark