DVC Pipelines

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

The need for a data pipeline

  • Versioning data alone is not very useful
  • Tracking data lineage is important
    • Filtering, cleaning, and transformation
    • Model training
  • Run only what's needed
  • Steps in Directed Acyclic Graph (DAG)

Schematic diagram of a DAG outlining relationship between model preprocessing, training, and artifacts

CI/CD for Machine Learning

DVC pipelines

  • Sequence of stages defining ML workflow and dependencies

  • Defined in dvc.yaml file

    • Input data and scripts (deps)
    • Stage execution commands (cmd)
    • Output artifacts (outs)
      • Special data e.g. metrics and plots
  • Similar to the GitHub Actions workflow

    • Focused on ML tasks instead of CI/CD
    • Can be abstracted as a step in GHA
CI/CD for Machine Learning

Defining pipeline stages

  • Create stages using dvc stage add
dvc stage add \
-n preprocess \
-d raw_data.csv -d preprocess.py \
-o processed_data.csv \
python preprocess.py
  • dvc.yaml contents
    stages:
    preprocess:
      cmd: python preprocess.py
      deps:
      - preprocess.py
      - raw_data.csv
      outs:
      - processed_data.csv
    
CI/CD for Machine Learning

Dependency graphs

  • Add a training step using output from previous step
    dvc stage add \
    -n train \
    -d train.py -d processed_data.csv \
    -o plots.png -o metrics.txt \
    python train.py
    
stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - preprocess.py
    - raw_data.csv
    outs:
    - processed_data.csv

train: cmd: python train.py deps: - processed_data.csv - train.py outs: - plots.png
CI/CD for Machine Learning

Reproducing a pipeline

  • Reproduce the pipeline using dvc repro
-> dvc repro
Running stage 'preprocess':                                      
> python preprocess.py

Running stage 'train': > python train.py Updating lock file 'dvc.lock'
  • A state file dvc.lock is generated
    • Similar to .dvc file, captures MD5 hashes
    • Commit to Git immediately to record state git add dvc.lock && git commit -m "first pipeline repro"`
CI/CD for Machine Learning

Using cached results

  • Using cached results to speed up iteration
-> dvc repro
Stage 'preprocess' didn't change, skipping
Running stage 'train' with command: ...
CI/CD for Machine Learning

Visualizing DVC pipeline

-> dvc dag
+------------+ 
| preprocess | 
+------------+ 
       *       
       *       
       *       
  +-------+    
  | train |    
  +-------+
1 https://dvc.org/doc/command-reference/dag
CI/CD for Machine Learning

Summary

  • DVC pipelines are useful for
    • Rapid iteration due to caching
    • Reproducibility with dvc.yaml and dvc.lock
    • CI/CD enablement
  • Generate pipelines with dvc stage add
  • Reproduce/execute with dvc repro
  • Visualize with dvc dag
CI/CD for Machine Learning

Let's practice!

CI/CD for Machine Learning

Preparing Video For Download...