Writing effective ML documentation

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

The components of excellent ML documentation

  • Data sources
  • Data schemas
  • Labeling methods
  • Model experimentation + selection
  • Training environments
  • Model pseudocode
Developing Machine Learning Models for Production

Documenting data sources

Allows us to establish processes for evaluating the quality of our data.

This also offers other benefits:

  • Keep track of where data comes from.
  • Evaluate and iterate on the quality of data.

sources

Developing Machine Learning Models for Production

Data schemas

A structure that describes the organization of data.

For a relational database schema:

Database key Data type Data order
Person.name string nominal
Person.survey_score integer ordinal

schema

Developing Machine Learning Models for Production

Labeling methods (for classification)

Documenting how we labeled our response variable enhances:

  1. Reproducibility of the training pipeline.

  2. Model reliability through label quality.

  3. Model performance through label improvement.

select

Labeling methods can evolve over time.

Developing Machine Learning Models for Production

Model pseudocode

A visual representation of the different steps involved in building your machine learning model.

This often includes:

  • Feature engineering steps.
  • Components of an ensemble pipeline.
  • Example inputs and outputs of the model.
Developing Machine Learning Models for Production

Model experimentation + selection

Documenting the process of experimentation and selection of the best model includes documenting:

  • The model development process.
  • The models considered.
  • The metrics used.
  • The hyperparameter combinations tried for each model.

choice

Developing Machine Learning Models for Production

Training environments

To document our training environment, we should include:

  • Packages used with versions (eg. scikit-learn==1.1.3).
  • Any random seeds used for non-deterministic training (eg. dimensionality reduction algorithms).

Why?

  • Ability to reproduce the results of our machine learning models.
  • Ensuring consistency between training and production deployments.
Developing Machine Learning Models for Production

Let's practice!

Developing Machine Learning Models for Production

Preparing Video For Download...