Code organization and refactoring

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Prototyping vs production code

  • Prototyping code allows rapid iteration
  • But not suitable for production
    • Untested and prone to errors
    • Not modular with many repeated code blocks
    • Likely not reproducible
Introduction to Data Versioning with DVC

Features of good production code

  • Reproducible: recreate same outputs in different environments and time

  • Modular: written as distinct, independent, and testable modules

  • Consistent: Single source of truth for all parameters

    • A configuration/parameter file

A schematic diagram showing three properties of production code

Introduction to Data Versioning with DVC

Configuration files and YAML

  • Files should be in supported format

    • YAML, JSON, TOML, Python
    • Default is params.yaml
  • We'll work with YAML

    • YAML Ain't Markup Language
    • Allows a standard format to transfer data between languages or applications
    • Simple and clean format
    • Valid file extensions: .yaml or .yml
1 https://dvc.org/doc/command-reference/params#description
Introduction to Data Versioning with DVC

YAML Syntax

  • Specify parameters as dictionaries

    • Keys and values separated by :
  • Comments start with #

  • Data types:

    • Integer, Floats, Strings
  • Data structures:

    • Arrays
    • Nested Dictionaries
  • Indentation is important

# Key-value pairs
a: 1
b: 1.2
c: "String value"
# Arrays
a: [1, 2.2, 3, 4.8]
b:
  - 5
  - "String value"
# Nested dictionaries
a:
  b: "Some value"
  c: "Some other value"
Introduction to Data Versioning with DVC

Example configuration file

# Data preprocessing paramters
preprocess:
  ...
  target_column: RainTomorrow
  categorical_features:
    - Location
    - WindGustDir
    - ...

# Model training/evaluation paramters train_and_evaluate: rfc_params: n_estimators: 2 ...
Introduction to Data Versioning with DVC

Example modular function

# In model.py
def evaluate_model(model, X_test, y_test):
    """Evaluate a model on a test set and return metrics."""
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred)
    ...
    return { "accuracy": accuracy, "precision": precision,
        "recall": recall, "f1_score": f1 }
# In entry-point code (train_and_evaluate.py)
from model import evaluate_model
metrics = evaluate_model(model, X_test, y_test)
Introduction to Data Versioning with DVC

Sample project code layout

Image of code layout in a Machine Learning repository

Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...