Machine Learning dengan Model Berbasis Pohon di Python
Elie Kawerk
Data Scientist
Model machine learning:
parameter: dipelajari dari data
hiperparameter: tidak dipelajari dari data, ditetapkan sebelum pelatihan
max_depth, min_samples_leaf, kriteria pembelahan ...Masalah: mencari himpunan hiperparameter optimal untuk suatu algoritma.
Solusi: temukan himpunan hiperparameter yang menghasilkan model optimal.
Model optimal: menghasilkan skor terbaik.
Skor: di sklearn default-nya akurasi (klasifikasi) dan $R^2$ (regresi).
Cross-validation digunakan untuk mengestimasi performa generalisasi.
Di sklearn, hiperparameter default model tidak optimal untuk semua masalah.
Hiperparameter perlu disetel untuk kinerja terbaik.
Grid Search
Random Search
Bayesian Optimization
Genetic Algorithms
....
Tetapkan kisi nilai hiperparameter diskret secara manual.
Tetapkan metrik untuk menilai kinerja model.
Telusuri kisi secara menyeluruh.
Untuk tiap set hiperparameter, evaluasi skor CV model.
Hiperparameter optimal adalah milik model dengan skor CV terbaik.
max_depth = {2,3,4},min_samples_leaf = {0.05, 0.1}# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Set seed to 1 for reproducibility
SEED = 1
# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)
# Print out 'dt's hyperparameters
print(dt.get_params())
{'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'presort': False,
'random_state': 1,
'splitter': 'best'}
# Import GridSearchCV from sklearn.model_selection import GridSearchCV# Define the grid of hyperparameters 'params_dt' params_dt = { 'max_depth': [3, 4,5, 6], 'min_samples_leaf': [0.04, 0.06, 0.08], 'max_features': [0.2, 0.4,0.6, 0.8] }# Instantiate a 10-fold CV grid search object 'grid_dt' grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='accuracy', cv=10, n_jobs=-1)# Fit 'grid_dt' to the training data grid_dt.fit(X_train, y_train)
# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('Best hyerparameters:\n', best_hyperparams)
Best hyerparameters:
{'max_depth': 3, 'max_features': 0.4, 'min_samples_leaf': 0.06}
# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('Best CV accuracy'.format(best_CV_score))
Best CV accuracy: 0.938
# Extract best model from 'grid_dt' best_model = grid_dt.best_estimator_# Evaluate test set accuracy test_acc = best_model.score(X_test,y_test) # Print test set accuracy print("Test set accuracy of best model: {:.3f}".format(test_acc))
Test set accuracy of best model: 0.947
Machine Learning dengan Model Berbasis Pohon di Python