Introduction to Data Versioning with DVC
Ravi Bhadauria
Machine Learning Engineer
stages:
preprocess:
cmd: python3 preprocess.py
params:
- preprocess
deps:
- preprocess.py
- raw_data.csv
outs:
- processed_data.csv
stages:
train_and_evaluate:
cmd: python3 train_and_evaluate.py
params:
- train_and_evaluate
deps:
- processed_data.csv
- train_and_evaluate.py
outs:
- plots.png
- metrics.json
dvc repro
$ dvc repro
Running stage 'preprocess': > python preprocess.py
Running stage 'train_and_evaluate': > python train_and_evaluate.py Updating lock file 'dvc.lock'
dvc.lock
is generated.dvc
file, captures MD5 hashes$ git add dvc.lock && git commit -m "first pipeline run"
$ dvc repro
Stage 'preprocess' didn't change, skipping
Running stage 'train_and_evaluate' with command: ...
--dry
flag to only print commands without running the pipeline$ dvc repro --dry
Running stage 'preprocess':
> python3 preprocess_dataset.py
Running stage 'train_and_evaluate':
> python3 train_and_evaluate.py
dvc repro linear/dvc.yaml
dvc.yaml
in one folder are not alloweddvc repro <target_stage>
dvc repro -f
dvc repro --no-commit
dvc commit
later# Run A2 and its upstream dependencies
$ dvc repro A2
# Run B2 and its upstream dependencies
$ dvc repro B2
$ dvc repro train
Stage 'A2' didn't change, skipping
Stage 'B2' didn't change, skipping
Running stage 'train' with command: ...
Introduction to Data Versioning with DVC