Ensemble

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

What's an ensemble?

It's a collection of models.

A collection of similar models

Wisdom of the Crowd — collective opinion of a group better than that of a single expert.

Ensemble diversity

Diversity and independence are important because the best collective decisions are the product of disagreement and contest, not consensus or compromise.

? James Surowiecki, The Wisdom of Crowds

Random Forest

Random Forest — an ensemble of Decision Trees

Creating model diversity:

each tree trained on random subset of data
random subset of features used for splitting at each node

No two trees in the forest should be the same.

A collection of trees

Create a forest of trees

Returning to cars data: manufactured in USA (0.0) or not (1.0).

Create Random Forest classifier.

from pyspark.ml.classification import RandomForestClassifier

forest = RandomForestClassifier(numTrees=5)

Fit to the training data.

forest = forest.fit(cars_train)

Seeing the trees

How to access trees within forest?

forest.trees

[DecisionTreeClassificationModel (uid=dtc_aa66702a4ce9) of depth 5 with 17 nodes,
 DecisionTreeClassificationModel (uid=dtc_99f7efedafe9) of depth 5 with 31 nodes,
 DecisionTreeClassificationModel (uid=dtc_9306e4a5fa1d) of depth 5 with 21 nodes,
 DecisionTreeClassificationModel (uid=dtc_d643bd48a8dd) of depth 5 with 23 nodes,
 DecisionTreeClassificationModel (uid=dtc_a2d5abd67969) of depth 5 with 27 nodes]

These can each be used to make individual predictions.

Predictions from individual trees

What predictions are generated by each tree?

+------+------+------+------+------+-----+
|tree 0|tree 1|tree 2|tree 3|tree 4|label|
+------+------+------+------+------+-----+
|   0.0|   0.0|   0.0|   0.0|   0.0|  0.0| <- perfect agreement
|   1.0|   1.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   0.0|   0.0|   1.0|   1.0|  1.0|
|   0.0|   0.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   1.0|   1.0|   1.0|   0.0|  1.0|
|   1.0|   1.0|   0.0|   1.0|   1.0|  1.0|
|   1.0|   1.0|   1.0|   1.0|   1.0|  1.0| <- perfect agreement
+------+------+------+------+------+-----+

Consensus predictions

Use the .transform() method to generate consensus predictions.

+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0.0  |[0.8,0.2]                               |0.0       |
|0.0  |[0.4,0.6]                               |1.0       |
|1.0  |[0.5333333333333333,0.4666666666666666] |0.0       |
|0.0  |[0.7177777777777778,0.28222222222222226]|0.0       |
|1.0  |[0.39396825396825397,0.606031746031746] |1.0       |
|1.0  |[0.17660818713450294,0.823391812865497] |1.0       |
|1.0  |[0.053968253968253964,0.946031746031746]|1.0       |
+-----+----------------------------------------+----------+

Feature importances

The model uses these features: cyl, size, mass, length, rpm and consumption.

Which of these is most or least important?

forest.featureImportances

SparseVector(6, {0: 0.0205, 1: 0.2701, 2: 0.108, 3: 0.1895, 4: 0.2939, 5: 0.1181})

Looks like:

rpm is most important
cyl is least important.

Gradient-Boosted Trees

Iterative boosting algorithm:

Build a Decision Tree and add to ensemble.
Predict label for each training instance using ensemble.
Compare predictions with known labels.
Emphasize training instances with incorrect predictions.
Return to 1.

Model improves on each iteration.

Boosting trees

Create a Gradient-Boosted Tree classifier.

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)

Fit to the training data.

gbt = gbt.fit(cars_train)

Comparing trees

Let's compare the three types of tree models on testing data.

# AUC for Decision Tree
0.5875

# AUC for Random Forest
0.65

# AUC for Gradient-Boosted Tree
0.65

Both of the ensemble methods perform better than a plain Decision Tree.

Ensemble all of the models!

Machine Learning with PySpark