Data and model versioning

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

Major & minor versioning

Two types of versioning: major and minor

  • Major versioning indicates a big change in data or model

  • Minor versioning indicates a small change

With major/minor versions we can see how and what data or model changes were made

Developing Machine Learning Models for Production

Versioning training data

  • Unique major/minor labels or timestamps
  • Ensuring reproducibility and traceability of experiments
  • Easily going back to a previous version of data
  • Seeing how data has changed over time

Example:

  1. Data V 1.0 was our initial dataset
  2. Data V 1.1 included additional feature transformations
  3. Data V 1.2 added a feature selection method
  4. Data V 2.0 includes a brand new source of data
Developing Machine Learning Models for Production

Feature stores

  • Central repository for storing and managing different feature versions
  • Easily tracking different versions of features
  • Using features for different experiments
  • Improving collaboration and experimentation integrity
  • Reducing duplication of work
Developing Machine Learning Models for Production

Versioning ML models

  • Keeping track of different versions of ML models
  • Ensuring reproducibility and traceability of experiments
  • Easily rollback to a previous version of model
  • Usually matches training data version but not always

Example:

  1. Model V 1.0 is our initial model (Random Forest)
  2. Model V 1.1 is the same model fine-tuned on Data V 1.1
  3. Model V 2.0 is now a XGBoost Model fine-tuned Data V 1.2
  4. Model V 2.1 is the XGBoost model fine-tuned on Data V 2.0
Developing Machine Learning Models for Production

Model stores

  • Central repository for storing and managing different model versions
  • Easily tracking different versions of models
  • Rollback to previous versions
Developing Machine Learning Models for Production

Example of model versioning with MLflow

import mlflow

# Start a new mlflow run
with mlflow.start_run() as run:
    # Log model version as a parameter
    mlflow.log_param("model_version", "1.0")

    # Train and save the model
    model = train_model()
    mlflow.sklearn.log_model(model, "model")
Developing Machine Learning Models for Production

Let's practice!

Developing Machine Learning Models for Production

Preparing Video For Download...