Executing DVC pipelines

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Recap

  • Preprocessing stage
stages:
  preprocess:
    cmd: python3 preprocess.py
    params:
      - preprocess
    deps:
    - preprocess.py
    - raw_data.csv
    outs:
    - processed_data.csv
  • Training and evaluation
stages:
  train_and_evaluate:
    cmd: python3 train_and_evaluate.py
    params:
      - train_and_evaluate
    deps:
    - processed_data.csv
    - train_and_evaluate.py
    outs:
    - plots.png
    - metrics.json
Introduction to Data Versioning with DVC

Reproducing a pipeline

  • Reproduce the pipeline using dvc repro
$ dvc repro
Running stage 'preprocess':                                      
> python preprocess.py

Running stage 'train_and_evaluate': > python train_and_evaluate.py Updating lock file 'dvc.lock'
  • A state file dvc.lock is generated
    • Similar to .dvc file, captures MD5 hashes
$ git add dvc.lock && git commit -m "first pipeline run"
Introduction to Data Versioning with DVC

Using cached results

  • Using cached results to speed up iteration
$ dvc repro
Stage 'preprocess' didn't change, skipping
Running stage 'train_and_evaluate' with command: ...
Introduction to Data Versioning with DVC

Stage caching in DVC

Flowchart of how stage caching works in DVC

Introduction to Data Versioning with DVC

Dry running a pipeline

  • Use --dry flag to only print commands without running the pipeline
$ dvc repro --dry
Running stage 'preprocess':
> python3 preprocess_dataset.py

Running stage 'train_and_evaluate':
> python3 train_and_evaluate.py
Introduction to Data Versioning with DVC

Additional arguments

  • Running specific files dvc repro linear/dvc.yaml
    • Multiple dvc.yaml in one folder are not allowed
  • Running specific stages dvc repro <target_stage>
    • This will also run upstream dependencies
  • Force run a pipeline/stage dvc repro -f
  • Not storing execution outputs in cache dvc repro --no-commit
    • Use dvc commit later
Introduction to Data Versioning with DVC

Parallel stage execution

Schematic of parallel stage execution DAG in DVC

  • Run independent steps concurrently
# Run A2 and its upstream dependencies
$ dvc repro A2
# Run B2 and its upstream dependencies
$ dvc repro B2
  • Use caching to speed up execution
$ dvc repro train
Stage 'A2' didn't change, skipping
Stage 'B2' didn't change, skipping
Running stage 'train' with command: ...
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...