Ensemble

Machine Learning dengan PySpark

Andrew Collier

Data Scientist, Fathom Data

Apa itu ensemble?

Ini kumpulan model.

Sekumpulan model serupa

Kebijaksanaan Kerumunan — pendapat kolektif kelompok sering lebih baik daripada satu ahli.

Keberagaman dalam ensemble

Keragaman dan kemandirian penting karena keputusan kolektif terbaik lahir dari perbedaan dan perdebatan, bukan konsensus atau kompromi.

― James Surowiecki, The Wisdom of Crowds

Random Forest

Random Forest — ensemble dari Decision Tree

Menciptakan keragaman model:

tiap pohon dilatih pada subset acak data
subset acak fitur dipakai untuk split di tiap node

Tak ada dua pohon di forest yang sama.

Sekumpulan pohon

Buat hutan pohon

Kembali ke data mobil: dibuat di AS (0.0) atau bukan (1.0).

Buat classifier Random Forest.

from pyspark.ml.classification import RandomForestClassifier

forest = RandomForestClassifier(numTrees=5)

Latih pada data training.

forest = forest.fit(cars_train)

Melihat pohon-pohon

Bagaimana mengakses pohon dalam forest?

forest.trees

[DecisionTreeClassificationModel (uid=dtc_aa66702a4ce9) of depth 5 with 17 nodes,
 DecisionTreeClassificationModel (uid=dtc_99f7efedafe9) of depth 5 with 31 nodes,
 DecisionTreeClassificationModel (uid=dtc_9306e4a5fa1d) of depth 5 with 21 nodes,
 DecisionTreeClassificationModel (uid=dtc_d643bd48a8dd) of depth 5 with 23 nodes,
 DecisionTreeClassificationModel (uid=dtc_a2d5abd67969) of depth 5 with 27 nodes]

Masing-masing dapat dipakai untuk prediksi individual.

Prediksi dari tiap pohon

Prediksi apa yang dihasilkan tiap pohon?

+------+------+------+------+------+-----+
|tree 0|tree 1|tree 2|tree 3|tree 4|label|
+------+------+------+------+------+-----+
|   0.0|   0.0|   0.0|   0.0|   0.0|  0.0| <- sepakat sempurna
|   1.0|   1.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   0.0|   0.0|   1.0|   1.0|  1.0|
|   0.0|   0.0|   0.0|   1.0|   0.0|  0.0|
|   0.0|   1.0|   1.0|   1.0|   0.0|  1.0|
|   1.0|   1.0|   0.0|   1.0|   1.0|  1.0|
|   1.0|   1.0|   1.0|   1.0|   1.0|  1.0| <- sepakat sempurna
+------+------+------+------+------+-----+

Prediksi konsensus

Gunakan metode .transform() untuk menghasilkan prediksi konsensus.

+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0.0  |[0.8,0.2]                               |0.0       |
|0.0  |[0.4,0.6]                               |1.0       |
|1.0  |[0.5333333333333333,0.4666666666666666] |0.0       |
|0.0  |[0.7177777777777778,0.28222222222222226]|0.0       |
|1.0  |[0.39396825396825397,0.606031746031746] |1.0       |
|1.0  |[0.17660818713450294,0.823391812865497] |1.0       |
|1.0  |[0.053968253968253964,0.946031746031746]|1.0       |
+-----+----------------------------------------+----------+

Pentingnya fitur

Model menggunakan fitur: cyl, size, mass, length, rpm, dan consumption.

Mana yang paling atau paling tidak penting?

forest.featureImportances

SparseVector(6, {0: 0.0205, 1: 0.2701, 2: 0.108, 3: 0.1895, 4: 0.2939, 5: 0.1181})

Terlihat seperti:

rpm paling penting
cyl paling tidak penting.

Gradient-Boosted Trees

Algoritme boosting iteratif:

Bangun Decision Tree dan tambahkan ke ensemble.
Prediksi label untuk tiap instance latih dengan ensemble.
Bandingkan prediksi dengan label benar.
Tekankan instance latih yang salah diprediksi.
Kembali ke 1.

Model membaik di tiap iterasi.

Boosting pohon

Buat classifier Gradient-Boosted Tree.

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)

Latih pada data training.

gbt = gbt.fit(cars_train)

Membandingkan pohon

Mari bandingkan tiga jenis model pohon pada data uji.

# AUC untuk Decision Tree
0.5875

# AUC untuk Random Forest
0.65

# AUC untuk Gradient-Boosted Tree
0.65

Kedua metode ensemble lebih baik daripada Decision Tree tunggal.

Ensemble semua model!

Machine Learning dengan PySpark