Building a Model

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

RandomForestRegressor

Basic Model Parameters

featuresCol="features"
labelCol="label"
predictionCol="prediction"
seed=None

Our Model Parameter values

featuresCol="features"
labelCol="SALESCLOSEPRICE"
predictionCol="Prediction_Price"
seed=42

Training a Random Forest

from pyspark.ml.regression import RandomForestRegressor

# Initialize model with columns to utilize
rf = RandomForestRegressor(featuresCol="features",
                           labelCol="SALESCLOSEPRICE",
                           predictionCol="Prediction_Price",
                           seed=42
                           )

# Train model
model = rf.fit(train_df)

Predicting with a Model

# Make predictions
predictions = model.transform(test_df)

# Inspect results
predictions.select("Prediction_Price", "SALESCLOSEPRICE").show(5)

+------------------+---------------+
|  Prediction_Price|SALESCLOSEPRICE|
+------------------+---------------+
|426029.55463222397|         415000|
| 708510.8806005502|         842500|
| 164275.7116183204|         161000|
| 208943.4143642175|         200000|
|217152.43272221283|         205000|
+------------------+---------------+
only showing top 5 rows

Evaluating a Model

from pyspark.ml.evaluation import RegressionEvaluator

# Select columns to compute test error
evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", 
                                predictionCol="Prediction_Price")

# Create evaluation metrics
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

# Print Model Metrics
print('RMSE: ' + str(rmse))
print('R^2: ' + str(r2))

RMSE: 22898.84041072095
R^2: 0.9666594402208077

Let's model some data!

Feature Engineering with PySpark