DVC Cache and Staging Files

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

DVC Cache

  • Hidden storage for tracked data files and versions
  • Stages temporary files until committed
    • Prefer adding large datasets and binary files
  • Lives inside the .dvc directory in the workspace
    • Configure location
      $ dvc cache dir ~/mycache
      

Image of DVC structure inside of workspace

Introduction to Data Versioning with DVC

Adding Files to Cache

  • Add data files to dvc
$ dvc add data.csv
100% Adding...|====================|1/1 [00:00, 53.55file/s]

To track the changes with git, run:
    git add data.csv.dvc
To enable auto staging, run:
    dvc config core.autostage true
Introduction to Data Versioning with DVC

.dvc files

  • Each DVC tracked file has its corresponding .dvc file

    • data.csv -> data.csv.dvc
  • To version the data file, use git commit -m "data.csv.dvc"

  • Content of .dvc files

outs:

- md5: f38a850818377e97155d22755caa39d0
size: 16
hash: md5
path: data.csv
Introduction to Data Versioning with DVC

Interaction with DVC Cache

  • The path of cache file uses the MD5 value
$ find .dvc/cache -type f
.dvc/cache/f3/8a850818377e97155d22755caa39d0
  • Compute MD5 of dataset
$ md5 data.csv
MD5 (data.csv) = f38a850818377e97155d22755caa39d0
Introduction to Data Versioning with DVC

Interaction with DVC Cache

  • Use dvc add -v for verbose output

Image of output from running dvc add with -v flag

Introduction to Data Versioning with DVC

Removing from and Cleaning Cache

  • Remove added files using dvc remove
$ dvc remove data.csv.dvc
  • To clear the cache, use the dvc gc
    • Use with -w flag to remove workspace cache
$ dvc gc -w
WARNING: This will remove all cache except items used in the workspace of the current repo.
Are you sure you want to proceed? [y/n]: y
Removed 1 objects from repo cache.
Introduction to Data Versioning with DVC

Summary

  • DVC cache stages data files before commit
  • Configure cache location
    • dvc cache dir ~/mycache
  • Add files to cache
    • dvc add data.csv
    • Creates a .dvc file with metadata
  • Remove added files dvc remove data.csv.dvc
    • Clean workspace cache with dvc gc -w
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...