Interacting with DVC remotes

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

Understanding DVC Remotes

  • DVC Remotes: Location for Data Storage
  • Similar to Git remotes, but for cached data
  • Benefits of using remotes
    • Synchronize large files and directories
    • Centralize or distribute data storage
    • Save local space
  • Supported storage types
    • Amazon S3, GCS, Google Drive
    • SSH, HTTP, local file systems
CI/CD for Machine Learning

Setting Up Remotes

  • Set remotes using dvc remote add command
  • For AWS
dvc remote add myAWSremote s3://mybucket
  • Customizations can be done with dvc remote modify
dvc remote modify myAWSremote connect_timeout 300
  • .dvc/config change
 ['remote "myAWSremote"']
     url = s3://mybucket
     connect_timeout = 300
CI/CD for Machine Learning

Local and Default Remotes

  • Local remotes are used for rapid prototyping
  • Use system directories or Network Attached Storage
dvc remote add mylocalremote /tmp/dvc
  • Set default remotes with -d flag
dvc remote add -d mylocalremote /tmp/dvc
  • Default remote assigned in the core section of .dvc/config
[core]
remote = mylocalremote
CI/CD for Machine Learning

Uploading and Retrieving Data

  • Commands to transfer data

    • Push to remote: dvc push <target>
    • Pull from remote: dvc pull <target>
  • Similar to git push and git pull

    • .dvc is tracked by Git, not DVC
  • Target can be individual files dvc push data.csv
    • Or entire cache dvc push
  • Override default remote with -r flag
dvc push -r myAWSremote data.csv
CI/CD for Machine Learning

Tracking Data Changes

  • Change data file contents, then add dataset changes
dvc add /path/to/data/datafile
  • Commit changed .dvc file to Git
git add /path/to/datafile.dvc
git commit /path/to/datafile.dvc -m "Dataset updates"
  • Push metadata to Git
git push origin main
  • Upload changed data file
dvc push
CI/CD for Machine Learning

Let's practice!

CI/CD for Machine Learning

Preparing Video For Download...