Congratulations!

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Data versioning and DVC

  • Anatomy of Machine Learning Model
    • Code, data, and hyper-parameters precisely define a model
    • All three need to be tracked and versioned
  • Git and DVC
    • Git helps with tracking code, DVC helps with tracking data
    • Git tracks metadata about the actual data
  • DVC enables us
    • To version data and models
    • Run reproducible experiment pipelines
    • Track changes in metrics and plots
Introduction to Data Versioning with DVC

DVC setup, cache, and remotes

  • Setup
    • Install using pip install dvc
    • Initialize using dvc init
    • Use .dvcignore to control file patterns to track
  • Cache
    • Add files using dvc add
    • Track metadata using .dvc files
    • Remove using dvc remove, clean with dvc gc
  • Remotes
    • Configure using dvc remote add, list using dvc remote list
    • Upload and download data using dvc push and dvc pull
Introduction to Data Versioning with DVC

DVC pipelines

  • Anatomy of the dvc.yaml file

    • Use dvc stage add to add stages
    • Components include steps, commands, dependencies, params, and outputs
    • Track metrics and plots using the metrics and plots keys
  • Visualize and run DAG

    • Visualize using dvc dag
    • Run with dvc repro
  • Show and compare metrics and plots

    • Visualize using dvc plots show and dvc metrics show
    • Compare using dvc plots diff and dvc metrics diff
Introduction to Data Versioning with DVC

Thank you!

Introduction to Data Versioning with DVC

Preparing Video For Download...