Automatic machine learning with H2O

Hyperparameter Tuning in R

Dr. Shirin Elsinghorst

Senior Data Scientist

Automatic Machine Learning (AutoML)

Automatic tuning of algorithms, in addition to hyperparameters

AutoML makes model tuning and optimization much faster and easier

AutoML only needs a dataset, a target variable and a time or model number limit for training

AutoML in H2O

AutoML compares

Generalized Linear Model (GLM)
(Distributed) Random Forest (DRF)
Extremely Randomized Trees (XRT)
Extreme Gradient Boosting (XGBoost)
Gradient Boosting Machines (GBM)
Deep Learning (fully-connected multi-layer artificial neural network)
Stacked Ensembles (of all models & of best of family)

GBM Hyperparameters

histogram_type
ntrees
max_depth
min_rows
learn_rate
sample_rate
col_sample_rate
col_sample_rate_per_tree
min_split_improvement

Deep Learning Hyperparameters

epochs
adaptivate_rate
activation
rho
epsilon
input_dropout_ratio
hidden
hidden_dropout_ratios

# Using h2o.automl function
automl_model <- h2o.automl(x = x, y = y,
                           training_frame = train,
                           validation_frame = valid,
                           max_runtime_secs = 60,
                           sort_metric = "logloss",
                           seed = 42)

returns a leaderboard of all models, ranked by the chosen metric (here "logloss")

Slot "leader":
Model Details:
==============

H2OMultinomialModel: gbm
Model Summary: 
 number_of_trees number_of_internal_trees model_size_in_bytes min_depth
             189                      567               65728         1
 max_depth mean_depth min_leaves max_leaves mean_leaves
         5    2.96649          2          6     4.20988

Viewing the AutoML leaderboard

lb <- automl_model@leaderboard

                                    model_id mean_per_class_error
1  GBM_grid_0_AutoML_20181029_144443_model_6           0.01851852
2 GBM_grid_0_AutoML_20181029_144443_model_30           0.02777778
3 GBM_grid_0_AutoML_20181029_144443_model_18           0.02777778
4  GBM_grid_0_AutoML_20181029_144443_model_9           0.03703704

Per default, the leaderboard is calculated on 5-fold cross-validation.

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

Extracting models from AutoML leaderboard

# List all models by model id
model_ids <- as.data.frame(lb)$model_id

 [1] "GBM_grid_0_AutoML_20181029_144443_model_6"       
 [3] "GBM_grid_0_AutoML_20181029_144443_model_18"         
[19] "XRT_0_AutoML_20181029_144443"
[20] "DRF_0_AutoML_20181029_144443"            
[24] "DeepLearning_0_AutoML_20181029_144443"                        
[41] "StackedEnsemble_BestOfFamily_0_AutoML_20181029_144443" 
[42] "StackedEnsemble_AllModels_0_AutoML_20181029_144443"

# Get the best model
aml_leader <- automl_model@leader

aml_leader is again a regular H2O model object and can be treated as such!

Get ready for your last round of exercises!

Hyperparameter Tuning in R