Introduction to DVC

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Git as Version Control

  • Code version control system
  • Independent local development
    • Branch and merge
    • Version history management
  • Enables collaboration

Schematic of decentralized model of Git

Introduction to Data Versioning with DVC

Git as Version Control

  • CLI based interaction
  • Run on terminal, aka shell
  • Git tracks contents via a repository
    • Actual files/folders to be tracked
    • Git metadata (in .git folder)

Image of Git repository structure

Introduction to Data Versioning with DVC

Data Version Control (DVC)

  • DVC: Data Version Control tool
    • Manages data and experiments
    • Similar to Git

Concept picture of versioning data with Git and DVC.png

  • Git tracks metadata, DVC handles data versioning
Introduction to Data Versioning with DVC

Git vs DVC CLI

Git

  • Initialize repository in working folder
$ git init
  • Adding files to repository (staging changes)
$ git add code.py
  • Commit changes (in version history)
$ git commit -m "adding first file"

DVC

  • Initialize DVC repository in working folder
$ dvc init
  • Adding data files to DVC
$ dvc add data/mydata.csv
  • Updating all tracked data files
$ dvc commit
Introduction to Data Versioning with DVC

Git vs DVC CLI

Git

  • Push code changes to remote server
$ git push
  • Pulling changes from remote
$ git pull
  • Cloning an existing repository from remote (Github)
$ git clone \
https://github.com/username/repository-name.git

DVC

  • Push data changes to remote data server
$ dvc push
  • Synchronizing your DVC project
$ dvc pull
  • Download a file or directory tracked by DVC
$ dvc get \
https://github.com/username/repo-name model.pkl
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...