Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Random Forest Regression
Economic
Governmental
Social
Seasonal
Temporal Features
Rates, Ratios, Sums
Expanded Features
# What is shape of our data?
print((df.count(), len(df.columns)))
(5000, 126)
from pyspark.ml.feature import VectorAssembler
# Replace Missing values
df = df.fillna(-1)
# Define the columns to be converted to vectors
features_cols = list(df.columns)
# Remove the dependent variable from the list
features_cols.remove('SALESCLOSEPRICE')
# Create the vector assembler transformer vec = VectorAssembler(inputCols=features_cols, outputCol='features')
# Apply the vector transformer to data df = vec.transform(df)
# Select only the feature vectors and the dependent variable ml_ready_df = df.select(['SALESCLOSEPRICE', 'features'])
# Inspect Results ml_ready_df.show(5)
+----------------+--------------------+
| SALESCLOSEPRICE| features|
+----------------+--------------------+
|143000 |(125,[0,1,2,3,5,6...|
|190000 |(125,[0,1,2,3,5,6...|
|225000 |(125,[0,1,2,3,5,6...|
|265000 |(125,[0,1,2,3,4,5...|
|249900 |(125,[0,1,2,3,4,5...|
+----------------+--------------------+
only showing top 5 rows
Feature Engineering with PySpark