Introduction to Data Versioning with DVC
Ravi Bhadauria
Machine Learning Engineer
.dvc
directory in the workspace$ dvc cache dir ~/mycache
$ dvc add data.csv
100% Adding...|====================|1/1 [00:00, 53.55file/s]
To track the changes with git, run:
git add data.csv.dvc
To enable auto staging, run:
dvc config core.autostage true
Each DVC tracked file has its corresponding .dvc
file
data.csv -> data.csv.dvc
To version the data file, use git commit -m "data.csv.dvc"
Content of .dvc
files
outs:
- md5: f38a850818377e97155d22755caa39d0
size: 16
hash: md5
path: data.csv
$ find .dvc/cache -type f
.dvc/cache/f3/8a850818377e97155d22755caa39d0
$ md5 data.csv
MD5 (data.csv) = f38a850818377e97155d22755caa39d0
dvc add -v
for verbose outputdvc remove
$ dvc remove data.csv.dvc
dvc gc
-w
flag to remove workspace cache$ dvc gc -w
WARNING: This will remove all cache except items used in the workspace of the current repo.
Are you sure you want to proceed? [y/n]: y
Removed 1 objects from repo cache.
dvc cache dir ~/mycache
dvc add data.csv
.dvc
file with metadatadvc remove data.csv.dvc
dvc gc -w
Introduction to Data Versioning with DVC