Interacting with DVC Remotes

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Uploading and Retrieving Data

  • Moving data from cache to DVC remote
$ dvc push <target>

$ dvc pull <target>
  • Targets are individual files
$ dvc push data.csv
Introduction to Data Versioning with DVC

Uploading and Retrieving Data

  • Push entire cache
$ dvc push
  • Update the cache without changing workspace contents
$ dvc fetch
  • Override default remote with -r flag
$ dvc push -r aws_remote data.csv
Introduction to Data Versioning with DVC

Similarities with Git

dvc pull

  • Function: Downloads remote data to DVC workspace

  • Use Case: Large datasets or model artifacts

dvc push

  • Function: Uploads data to remote storage

  • Use Case: Sharing or storing data artifacts

git pull

  • Function: Fetch/Merge data remote Git repo

  • Use Case: Local branch in sync with remote

git push

  • Function: Uploads local changes to remote

  • Use Case: Share changes to Git remote

Introduction to Data Versioning with DVC

Versioning data

  • .dvc is tracked by Git, not DVC

  • Leverage this to checkout specific version of data file

  • Checkout .dvc file

$ git checkout <commit_hash|tag|branch>
  • Retrieve data with MD5 specified in .dvc file
$ dvc checkout <target>
Introduction to Data Versioning with DVC

Tracking Data Changes

  • Change data file contents, then add dataset changes
$ dvc add <target>
  • Commit changed .dvc file to Git
$ git add <target>.dvc
$ git commit <target>.dvc \
    -m "Dataset updates"
  • Push metadata to Git
$ git push origin main
  • Upload changed data file
$ dvc push
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...