Building a Model

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

RandomForestRegressor

Basic Model Parameters

  • featuresCol="features"
  • labelCol="label"
  • predictionCol="prediction"
  • seed=None

Our Model Parameter values

  • featuresCol="features"
  • labelCol="SALESCLOSEPRICE"
  • predictionCol="Prediction_Price"
  • seed=42
Feature Engineering with PySpark

Training a Random Forest

from pyspark.ml.regression import RandomForestRegressor
# Initialize model with columns to utilize
rf = RandomForestRegressor(featuresCol="features",
                           labelCol="SALESCLOSEPRICE",
                           predictionCol="Prediction_Price",
                           seed=42
                           )
# Train model
model = rf.fit(train_df)
Feature Engineering with PySpark

Predicting with a Model

# Make predictions
predictions = model.transform(test_df)
# Inspect results
predictions.select("Prediction_Price", "SALESCLOSEPRICE").show(5)
+------------------+---------------+
|  Prediction_Price|SALESCLOSEPRICE|
+------------------+---------------+
|426029.55463222397|         415000|
| 708510.8806005502|         842500|
| 164275.7116183204|         161000|
| 208943.4143642175|         200000|
|217152.43272221283|         205000|
+------------------+---------------+
only showing top 5 rows
Feature Engineering with PySpark

Evaluating a Model

from pyspark.ml.evaluation import RegressionEvaluator
# Select columns to compute test error
evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", 
                                predictionCol="Prediction_Price")
# Create evaluation metrics
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
# Print Model Metrics
print('RMSE: ' + str(rmse))
print('R^2: ' + str(r2))
RMSE: 22898.84041072095
R^2: 0.9666594402208077
Feature Engineering with PySpark

Let's model some data!

Feature Engineering with PySpark

Preparing Video For Download...