Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
import pandas as pd
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(model.featureImportances.toArray(),
columns=['importance'])
# Convert list of feature names to pandas column
fi_df['feature'] = pd.Series(feature_cols)
# Sort the data based on feature importance
fi_df.sort_values(by=['importance'], ascending=False, inplace=True)
# Interpret results
model_df.head(9)
| feature |importance|
|-------------------------|----------|
| LISTPRICE | 0.312101 |
| ORIGINALLISTPRICE | 0.202142 |
| LIVINGAREA | 0.124239 |
| SQFT_TOTAL | 0.081260 |
| LISTING_TO_MEDIAN_RATIO | 0.075086 |
| TAXES | 0.048452 |
| SQFTABOVEGROUND | 0.045859 |
| BATHSTOTAL | 0.034397 |
| LISTING_PRICE_PER_SQFT | 0.018253 |
# Save model
model.save('rfr_real_estate_model')
from pyspark.ml.regression import RandomForestRegressionModel
# Load model from
model2 = RandomForestRegressionModel.load('rfr_real_estate_model')
Feature Engineering with PySpark