Machine Learning with PySpark
Andrew Collier
Data Scientist, Fathom Data
It's a collection of models.
Wisdom of the Crowd — collective opinion of a group better than that of a single expert.
Diversity and independence are important because the best collective decisions are the product of disagreement and contest, not consensus or compromise.
? James Surowiecki, The Wisdom of Crowds
Random Forest — an ensemble of Decision Trees
Creating model diversity:
No two trees in the forest should be the same.
Returning to cars data: manufactured in USA (0.0
) or not (1.0
).
Create Random Forest classifier.
from pyspark.ml.classification import RandomForestClassifier
forest = RandomForestClassifier(numTrees=5)
Fit to the training data.
forest = forest.fit(cars_train)
How to access trees within forest?
forest.trees
[DecisionTreeClassificationModel (uid=dtc_aa66702a4ce9) of depth 5 with 17 nodes,
DecisionTreeClassificationModel (uid=dtc_99f7efedafe9) of depth 5 with 31 nodes,
DecisionTreeClassificationModel (uid=dtc_9306e4a5fa1d) of depth 5 with 21 nodes,
DecisionTreeClassificationModel (uid=dtc_d643bd48a8dd) of depth 5 with 23 nodes,
DecisionTreeClassificationModel (uid=dtc_a2d5abd67969) of depth 5 with 27 nodes]
These can each be used to make individual predictions.
What predictions are generated by each tree?
+------+------+------+------+------+-----+
|tree 0|tree 1|tree 2|tree 3|tree 4|label|
+------+------+------+------+------+-----+
| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| <- perfect agreement
| 1.0| 1.0| 0.0| 1.0| 0.0| 0.0|
| 0.0| 0.0| 0.0| 1.0| 1.0| 1.0|
| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0|
| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|
| 1.0| 1.0| 0.0| 1.0| 1.0| 1.0|
| 1.0| 1.0| 1.0| 1.0| 1.0| 1.0| <- perfect agreement
+------+------+------+------+------+-----+
Use the .transform()
method to generate consensus predictions.
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|0.0 |[0.8,0.2] |0.0 |
|0.0 |[0.4,0.6] |1.0 |
|1.0 |[0.5333333333333333,0.4666666666666666] |0.0 |
|0.0 |[0.7177777777777778,0.28222222222222226]|0.0 |
|1.0 |[0.39396825396825397,0.606031746031746] |1.0 |
|1.0 |[0.17660818713450294,0.823391812865497] |1.0 |
|1.0 |[0.053968253968253964,0.946031746031746]|1.0 |
+-----+----------------------------------------+----------+
The model uses these features: cyl
, size
, mass
, length
, rpm
and consumption
.
Which of these is most or least important?
forest.featureImportances
SparseVector(6, {0: 0.0205, 1: 0.2701, 2: 0.108, 3: 0.1895, 4: 0.2939, 5: 0.1181})
Looks like:
rpm
is most importantcyl
is least important.Iterative boosting algorithm:
Model improves on each iteration.
Create a Gradient-Boosted Tree classifier.
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(maxIter=10)
Fit to the training data.
gbt = gbt.fit(cars_train)
Let's compare the three types of tree models on testing data.
# AUC for Decision Tree
0.5875
# AUC for Random Forest
0.65
# AUC for Gradient-Boosted Tree
0.65
Both of the ensemble methods perform better than a plain Decision Tree.
Machine Learning with PySpark