Hyperparameter Tuning with DVC

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

Hyperparameter tuning workflow

  • Hyperparameter tuning

    • Input: Parameter sweep ranges
    • Output: Best parameters
  • Training

    • Input: Best parameters
    • Output: Metrics and plots (already covered)
  • Loose coupling for independent training

    • Hyperparameter tuning sufficient but not necessary
  • Both jobs are dataset dependent

# Contents of hp configuration
{
    "n_estimators": [2, 4, 5],
    "max_depth": [10, 20, 50],
    "random_state": [1993]
}
# Contents of best parameters
{
    "n_estimators": 5,
    "max_depth": 20,
    "random_state": 1993
}
CI/CD for Machine Learning

Training code changes

Changes in Python Hyperparameter tuning script
# Load hyperparameters from the JSON file
with open("rfc_best_params.json", "r") as params_file:
  rfc_params = json.load(params_file)

# Define and train model model = RandomForestClassifier(**rfc_params) model.fit(X_train, y_train)
CI/CD for Machine Learning

Hyperparameter Tuning with GridSearch

# Define the model and hyperparameter search space
model = RandomForestClassifier()
param_grid = json.load(open("hp_config.json", "r"))

# Perform GridSearch with five fold CV grid_search = GridSearchCV(model, param_grid, cv=5) grid_search.fit(X_train, y_train)
# Get the best hyperparameters best_params = grid_search.best_params_ with open("rfc_best_params.json", "w") as outfile: json.dump(best_params, outfile)
CI/CD for Machine Learning

DVC YAML changes

Hyperparameter Tuning
stages:
  preprocess: ...
  train: ...
  hp_tune:
    cmd: python hp_tuning.py
    deps:
    - processed_dataset/weather.csv
    - hp_config.json
    - hp_tuning.py
    outs: # Not tracking best parameters
      - hp_tuning_results.md:
          cache: false
Training
stages:
  preprocess: ...
  hp_tune: ...
  train:
    cmd: python train.py
    deps:
    - processed_dataset/weather.csv
    - rfc_best_params.json # Best parameters
    - train.py
    metrics:
      - metrics.json:
          cache: false
CI/CD for Machine Learning

Triggering individual stages

  • Stages can be triggered independently dvc repro <stage_name>

  • Force run hyperparameter tuning stage dvc repro -f hp_tune

    • Ensures best parameter file will update
  • Training can be run with dvc repro train

  • Both stages trigger preprocessing step as dependency

CI/CD for Machine Learning

Hyperparameter Run Output

mean_test_score std_test_score max_depth n_estimators random_state
0.999733 0.000413118 20 5 1993
0.999307 0.000574418 50 5 1993
0.99888 0.000617378 10 5 1993
0.997813 0.00117333 10 4 1993

Changes in Python hyperparameter tuning script

# Save the results of hyperparameter tuning
cv_results = pd.DataFrame(grid_search.cv_results_)
markdown_table = cv_results.to_markdown(index=False)
with open("hp_tuning_results.md", "w") as markdown_file:
  markdown_file.write(markdown_table)
CI/CD for Machine Learning

Summary

  • Hyperparameter tuning route

    • Branch name hp_tune/<some-string>
    • Make changes to search configuration
    • Manually open a PR
      • Force runs DVC pipeline dvc repro -f hp_tune
      • Uses cml pr create to create a new training PR with best parameters
    • Force push a commit to training PR to kick off model training job
  • Manual route

    • Branch name train/<some-string>
    • Edit best parameters file and commit changes
    • Manually open a PR to kick off model training job
CI/CD for Machine Learning

Let's practice!

CI/CD for Machine Learning

Preparing Video For Download...