Writing and visualizing DVC pipelines

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

DVC Pipelines

  • Sequence of stages defining Machine Learning workflow and dependencies

    • Versioned and tracked with Git
  • Defined in dvc.yaml file

    • Input data and scripts (deps)
    • Parameters (params)
    • Stage execution commands (cmd)
    • Output artifacts (outs)
      • Special data e.g. metrics and plots
Introduction to Data Versioning with DVC

Adding preprocessing stage

  • Create stages using dvc stage add
dvc stage add \
-n preprocess \
-p params.yaml:preprocess \
-d raw_data.csv \
-d preprocess.py \
-o processed_data.csv \
python3 preprocess.py
stages:
  preprocess:
    cmd: python3 preprocess.py
    params:
      # Keys from params.yaml
      - params.yaml
        - preprocess
    deps:
    - preprocess.py
    - raw_data.csv
    outs:
    - processed_data.csv
Introduction to Data Versioning with DVC

Adding training and evaluation stage

  • Add a training step using output from previous step
dvc stage add \
-n train_and_evaluate \
-p train_and_evaluate \
-d train_and_evaluate.py \
-d processed_data.csv \
-o plots.png \
-o metrics.json \
python3 train_and_evaluate.py
  • Directed Acyclic Graph (DAG)
stages:
  train_and_evaluate:
    cmd: python3 train_and_evaluate.py
    params:
      # Skip specifying parameter file
      # Defaulted to params.yaml
      - train_and_evaluate
    deps:
    - processed_data.csv
    - train_and_evaluate.py
    outs:
    - plots.png
    - metrics.json
Introduction to Data Versioning with DVC

Updating stages

  • Running dvc stage add multiple times
ERROR: Stage 'train_and_evaluate' 
already exists in 'dvc.yaml'.
Use '--force' to overwrite.
  • Use dvc stage add --force
dvc stage add --force \
-n train_and_evaluate \
-p train_and_evaluate \
-d train_and_evaluate.py \
-d processed_data.csv \
-o plots.png \
-o metrics.json \
python3 train_and_evaluate.py
Introduction to Data Versioning with DVC

Visualizing DVC pipelines

# Print DAG on terminal
dvc dag
# Display DAG up to a certain step
dvc dag <target>
            +------------+     
            | preprocess |     
            +------------+     
                   *           
                   *           
                   *           
        +--------------------+ 
        | train_and_evaluate | 
        +--------------------+
Introduction to Data Versioning with DVC

Visualizing DVC pipelines

# Display step outputs as nodes
dvc dag --outs
                    +-------------------------------+              
                    | processed_dataset/weather.csv |              
                    +-------------------------------+              
                          ***               ***                    
                       ***                     ***                 
                     **                           **               
          +--------------+                    +-----------+ 
          | metrics.json |                    | plots.png | 
          +--------------+                    +-----------+
Introduction to Data Versioning with DVC

Visualizing DVC pipelines

dvc dag --dot
strict digraph  {
"preprocess";
"train_and_evaluate";
"preprocess" -> "train_and_evaluate";
}

A two stage DVC DAG as plotted as a graph from DOT file

1 https://dreampuf.github.io/GraphvizOnline/
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...