Versioning datasets with Data Version Control

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

Why versioning data matters

  • Versioning ensures a historical record of data changes
  • Data versioning is crucial
    • Reproducibility
    • Experimentation and Iteration
    • Collaboration
    • Bug Tracking and Debugging
    • Model Monitoring and Maintenance
    • Auditing and Validation
CI/CD for Machine Learning

Data Version Control (DVC)

  • DVC: Data Version Control tool
    • Manages data and experiments
    • Similar to Git

Concept picture of versioning data with Git and DVC.png

  • Git tracks metadata, DVC handles data versioning
CI/CD for Machine Learning

DVC Storage

  • Data stored separately
    • SSH, HTTP/HTTPS, Local File System
    • AWS, GCP, and Azure object storage
  • Install locally
pip install dvc
CI/CD for Machine Learning

Initializing DVC

  • Initialize Git git init
  • Initialize DVC
-> dvc init
Initialized DVC repository.
You can now commit the changes to git.
  • Sets up DVC project files, ready to version data
.dvc
|- .gitignore
|- config
|- tmp
CI/CD for Machine Learning

Adding Files to DVC

  • Add data files using dvc add <file> command
-> dvc add data.csv
  • .dvc placeholders containing file metadata are generated
    • Each DVC tracked file has its corresponding .dvc file (data.csv -> data.csv.dvc)
    • Checked into Git to manage data versions
  • DVC cache is populated in .dvc/cache
CI/CD for Machine Learning

DVC data files

  • Data file
-> cat data.csv
This is a sample data file.
  • data.csv.dvc file
-> cat data.csv.dvc
 outs:
 - md5: ea9972ac9f8fa321ea74969e93acc196

size: 28
hash: md5
path: data.csv
CI/CD for Machine Learning

Summary

  • Data versioning ensures reproducibility and collaboration
  • DVC: tool for managing data versioning, works with Git
  • Initialize DVC with dvc init, add files with dvc add <file>
  • .dvc files store metadata, Git tracks versions
CI/CD for Machine Learning

Let's practice!

CI/CD for Machine Learning

Preparing Video For Download...