Data Versioning Motivation

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

What is Data Versioning?

  • Definition
    • Monitors data changes over time
    • Snapshots data over iterations
    • Similar to code versioning
  • Benefits
    • Retrieval and scrutiny
    • Data consistency, accountability, and lineage
  • Applications
    • Data Science and Machine Learning
    • Data Engineering
    • Financial Analysis, Auditing and Compliance
Introduction to Data Versioning with DVC

Data vs Code Versioning

Code Versioning

  • Well known in software development
  • Uses tools like Git to do decentralized version control
  • Easier to manage as codebases are small

Data Versioning

  • Relatively new (SciDB proposed in 2012)
  • Toolchains like DVC are used in conjunction with Git
  • Relatively difficult to manage due to large dataset size
1 doi: 10.1109/ICDE.2012.102
Introduction to Data Versioning with DVC

Why Data Versioning in ML?

Image of constitution of ML model as Code, Hyperparameters, and Data

Introduction to Data Versioning with DVC

Dataset influence

Dataset A Image of dataframe head of dataset A

Dataset B Image of dataframe head of dataset B

Introduction to Data Versioning with DVC

Dataset influence

 

Hyperparameters are kept consistent, dataset changed

 

Metric Dataset A Dataset B
Precision 0.78 0.79
Recall 0.54 0.57
F1 Score 0.64 0.66
Accuracy 0.80 0.81
Introduction to Data Versioning with DVC

Hyperparameters influence

 

Dataset kept consistent, hyperparameters changed

 

Metric n_estimators=5 n_estimators=10
Precision 0.78 0.85
Recall 0.54 0.52
F1 Score 0.64 0.65
Accuracy 0.80 0.81
Introduction to Data Versioning with DVC

Editor Exercises Layout

Annotated picture of editor exercise layout displaying folder, file, and terminal area.

Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...