Introduction to Data Versioning with DVC
Ravi Bhadauria
Machine Learning Engineer
Sequence of stages defining Machine Learning workflow and dependencies
Defined in dvc.yaml
file
deps
)params
)cmd
)outs
)metrics
and plots
dvc stage add
dvc stage add \
-n preprocess \
-p params.yaml:preprocess \
-d raw_data.csv \
-d preprocess.py \
-o processed_data.csv \
python3 preprocess.py
stages:
preprocess:
cmd: python3 preprocess.py
params:
# Keys from params.yaml
- params.yaml
- preprocess
deps:
- preprocess.py
- raw_data.csv
outs:
- processed_data.csv
dvc stage add \
-n train_and_evaluate \
-p train_and_evaluate \
-d train_and_evaluate.py \
-d processed_data.csv \
-o plots.png \
-o metrics.json \
python3 train_and_evaluate.py
stages:
train_and_evaluate:
cmd: python3 train_and_evaluate.py
params:
# Skip specifying parameter file
# Defaulted to params.yaml
- train_and_evaluate
deps:
- processed_data.csv
- train_and_evaluate.py
outs:
- plots.png
- metrics.json
dvc stage add
multiple timesERROR: Stage 'train_and_evaluate'
already exists in 'dvc.yaml'.
Use '--force' to overwrite.
dvc stage add --force
dvc stage add --force \
-n train_and_evaluate \
-p train_and_evaluate \
-d train_and_evaluate.py \
-d processed_data.csv \
-o plots.png \
-o metrics.json \
python3 train_and_evaluate.py
# Print DAG on terminal
dvc dag
# Display DAG up to a certain step
dvc dag <target>
+------------+
| preprocess |
+------------+
*
*
*
+--------------------+
| train_and_evaluate |
+--------------------+
# Display step outputs as nodes
dvc dag --outs
+-------------------------------+
| processed_dataset/weather.csv |
+-------------------------------+
*** ***
*** ***
** **
+--------------+ +-----------+
| metrics.json | | plots.png |
+--------------+ +-----------+
dvc dag --dot
strict digraph {
"preprocess";
"train_and_evaluate";
"preprocess" -> "train_and_evaluate";
}
Introduction to Data Versioning with DVC