Dari workflow ke pipeline

Merancang Alur Kerja Machine Learning di Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Meninjau kembali workflow kita

from sklearn.ensemble import RandomForestClassifier as rf
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search = GridSearchCV(rf(), param_grid={'max_depth': [2, 5, 10]})
grid_search.fit(X_train, y_train)
depth = grid_search.best_params_['max_depth']
vt = SelectKBest(f_classif, k=3).fit(X_train, y_train)
clf = rf(max_depth=best_value).fit(vt.transform(X_train), y_train)
accuracy_score(clf.predict(vt.transform(X_test), y_test))
Merancang Alur Kerja Machine Learning di Python

Kekuatan grid search

Optimalkan max_depth:

pg = {'max_depth': [2,5,10]}
gs = GridSearchCV(rf(),  
   param_grid=pg)
gs.fit(X_train, y_train)
depth = gs.best_params_['max_depth']

Tabel semua kombinasi nilai depth dan jumlah estimator, menunjukkan tiga nilai dieksplorasi dan satu terbaik.

Merancang Alur Kerja Machine Learning di Python

Kekuatan grid search

Lalu optimalkan n_estimators:

pg = {'n_estimators': [10,20,30]}
gs = GridSearchCV(
   rf(max_depth=depth),  
   param_grid=pg)
gs.fit(X_train, y_train)
n_est = gs.best_params_[
    'n_estimators']

Tabel semua kombinasi nilai depth dan jumlah estimator, menunjukkan lima nilai dieksplorasi dan lainnya jadi terbaik.

Merancang Alur Kerja Machine Learning di Python

Kekuatan grid search

Gabungkan max_depth dan n_estimators:

pg = {
   'max_depth': [2,5,10],
   'n_estimators': [10,20,30]
}
gs = GridSearchCV(rf(),  
   param_grid=pg)
gs.fit(X_train, y_train)
print(gs.best_params_) 

{'max_depth': 10, 'n_estimators': 20}

Tabel semua kombinasi nilai depth dan jumlah estimator, menunjukkan semua nilai dieksplorasi dan hasil terbaik sama.

Merancang Alur Kerja Machine Learning di Python

Pipeline

Pada diagram ini, random forest dengan dua hiperparameter terhubung ke pemilih fitur dengan satu hiperparameter melalui sebuah panah.

Merancang Alur Kerja Machine Learning di Python

Pipeline

Kedua objek dibungkus dalam satu kotak.

Merancang Alur Kerja Machine Learning di Python

Pipeline

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('feature_selection', SelectKBest(f_classif)), 
    ('classifier', RandomForestClassifier())
])

params = dict( feature_selection__k=[2, 3, 4], classifier__max_depth=[5, 10, 20] )
grid_search = GridSearchCV(pipe, param_grid=params) gs = grid_search.fit(X_train, y_train).best_params_
{'classifier__max_depth': 20, 'feature_selection__k': 4}
Merancang Alur Kerja Machine Learning di Python

Kustomisasi pipeline Anda

from sklearn.metrics import roc_auc_score, make_scorer
auc_scorer = make_scorer(roc_auc_score)

grid_search = GridSearchCV(pipe, param_grid=params, scoring=auc_scorer)
Merancang Alur Kerja Machine Learning di Python

Jangan berlebihan

params = dict(
    feature_selection__k=[2, 3, 4], 
    clf__max_depth=[5, 10, 20], 
    clf__n_estimators=[10, 20, 30] 
)
grid_search = GridSearchCV(pipe, params, cv=10)

3 x 3 x 3 x 10 = 270 pelatihan classifier!

Merancang Alur Kerja Machine Learning di Python

Workflow supercharged

Merancang Alur Kerja Machine Learning di Python

Preparing Video For Download...