Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Basic Model Parameters
featuresCol="features"
labelCol="label"
predictionCol="prediction"
seed=None
Our Model Parameter values
featuresCol="features"
labelCol="SALESCLOSEPRICE"
predictionCol="Prediction_Price"
seed=42
from pyspark.ml.regression import RandomForestRegressor
# Initialize model with columns to utilize
rf = RandomForestRegressor(featuresCol="features",
labelCol="SALESCLOSEPRICE",
predictionCol="Prediction_Price",
seed=42
)
# Train model
model = rf.fit(train_df)
# Make predictions
predictions = model.transform(test_df)
# Inspect results
predictions.select("Prediction_Price", "SALESCLOSEPRICE").show(5)
+------------------+---------------+
| Prediction_Price|SALESCLOSEPRICE|
+------------------+---------------+
|426029.55463222397| 415000|
| 708510.8806005502| 842500|
| 164275.7116183204| 161000|
| 208943.4143642175| 200000|
|217152.43272221283| 205000|
+------------------+---------------+
only showing top 5 rows
from pyspark.ml.evaluation import RegressionEvaluator
# Select columns to compute test error
evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE",
predictionCol="Prediction_Price")
# Create evaluation metrics
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
# Print Model Metrics
print('RMSE: ' + str(rmse))
print('R^2: ' + str(r2))
RMSE: 22898.84041072095
R^2: 0.9666594402208077
Feature Engineering with PySpark