Ensemble

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

What's an ensemble?

It's a collection of models.

A collection of similar models

Wisdom of the Crowd — collective opinion of a group better than that of a single expert.

Machine Learning with PySpark

Ensemble diversity

 

 

 

Diversity and independence are important because the best collective decisions are the product of disagreement and contest, not consensus or compromise.

? James Surowiecki, The Wisdom of Crowds

Machine Learning with PySpark

Random Forest

Random Forest — an ensemble of Decision Trees

Creating model diversity:

  • each tree trained on random subset of data
  • random subset of features used for splitting at each node

No two trees in the forest should be the same.

A collection of trees

Machine Learning with PySpark

Create a forest of trees

Returning to cars data: manufactured in USA (0.0) or not (1.0).

Create Random Forest classifier.

from pyspark.ml.classification import RandomForestClassifier

forest = RandomForestClassifier(numTrees=5)

Fit to the training data.

forest = forest.fit(cars_train)
Machine Learning with PySpark

Seeing the trees

How to access trees within forest?

forest.trees
[DecisionTreeClassificationModel (uid=dtc_aa66702a4ce9) of depth 5 with 17 nodes,
 DecisionTreeClassificationModel (uid=dtc_99f7efedafe9) of depth 5 with 31 nodes,
 DecisionTreeClassificationModel (uid=dtc_9306e4a5fa1d) of depth 5 with 21 nodes,
 DecisionTreeClassificationModel (uid=dtc_d643bd48a8dd) of depth 5 with 23 nodes,
 DecisionTreeClassificationModel (uid=dtc_a2d5abd67969) of depth 5 with 27 nodes]

These can each be used to make individual predictions.

Machine Learning with PySpark

Predictions from individual trees

What predictions are generated by each tree?

+------+------+------+------+------+-----+
|tree 0|tree 1|tree 2|tree 3|tree 4|label|
+------+------+------+------+------+-----+
|   0.0|   0.0|   0.0|   0.0|   0.0|  0.0| <- perfect agreement
|   1.0|   1.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   0.0|   0.0|   1.0|   1.0|  1.0|
|   0.0|   0.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   1.0|   1.0|   1.0|   0.0|  1.0|
|   1.0|   1.0|   0.0|   1.0|   1.0|  1.0|
|   1.0|   1.0|   1.0|   1.0|   1.0|  1.0| <- perfect agreement
+------+------+------+------+------+-----+
Machine Learning with PySpark

Consensus predictions

Use the .transform() method to generate consensus predictions.

+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0.0  |[0.8,0.2]                               |0.0       |
|0.0  |[0.4,0.6]                               |1.0       |
|1.0  |[0.5333333333333333,0.4666666666666666] |0.0       |
|0.0  |[0.7177777777777778,0.28222222222222226]|0.0       |
|1.0  |[0.39396825396825397,0.606031746031746] |1.0       |
|1.0  |[0.17660818713450294,0.823391812865497] |1.0       |
|1.0  |[0.053968253968253964,0.946031746031746]|1.0       |
+-----+----------------------------------------+----------+
Machine Learning with PySpark

Feature importances

The model uses these features: cyl, size, mass, length, rpm and consumption.

Which of these is most or least important?

forest.featureImportances
SparseVector(6, {0: 0.0205, 1: 0.2701, 2: 0.108, 3: 0.1895, 4: 0.2939, 5: 0.1181})

Looks like:

  • rpm is most important
  • cyl is least important.
Machine Learning with PySpark

Gradient-Boosted Trees

Iterative boosting algorithm:

  1. Build a Decision Tree and add to ensemble.
  2. Predict label for each training instance using ensemble.
  3. Compare predictions with known labels.
  4. Emphasize training instances with incorrect predictions.
  5. Return to 1.

Model improves on each iteration.

Machine Learning with PySpark

Boosting trees

Create a Gradient-Boosted Tree classifier.

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)

Fit to the training data.

gbt = gbt.fit(cars_train)
Machine Learning with PySpark

Comparing trees

Let's compare the three types of tree models on testing data.

# AUC for Decision Tree
0.5875

# AUC for Random Forest
0.65

# AUC for Gradient-Boosted Tree
0.65

Both of the ensemble methods perform better than a plain Decision Tree.

Machine Learning with PySpark

Ensemble all of the models!

Machine Learning with PySpark

Preparing Video For Download...