DVC Setup and Initialization

Introduction to Data Versioning with DVC

Ravi Bhadauria

Machine Learning Engineer

Installation

  • DVC is a Python package
    • Universally install with pip
$ pip install dvc
  • Remember to install in virtual environments
  • Ensure Git is installed
Introduction to Data Versioning with DVC

Verify Installation

$ dvc version
DVC version: 3.40.1 (pip)

Platform: Python 3.9.16 on macOS-14.2.1-arm64-arm-64bit
Config: Global: /Users/<username>/Library/Application Support/dvc System: /Library/Application Support/dvc
Repo: dvc, git
1 https://dvc.org/doc/command-reference/version
Introduction to Data Versioning with DVC

Initializing DVC

  • Ensure Git is initialized
$ git init
Initialized empty Git repository in /path/to/repo/.git/
  • Initialize DVC in the repository
$ dvc init
Initialized DVC repository.

You can now commit the changes to git.
1 https://dvc.org/doc/command-reference/init
Introduction to Data Versioning with DVC

DVC Hidden Files

  • Initialization creates internal files that should be tracked with Git
$ git status
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   .dvc/.gitignore
    new file:   .dvc/config
    new file:   .dvcignore
  • Commit the changes
$ git commit -m "initialized dvc"
Introduction to Data Versioning with DVC

.dvcignore File

  • Similar to .gitignore file

    • Follows the same pattern
    • Outline files/directories that DVC will ignore
  • Useful when tracking many data files not needed

    • Improves execution time of DVC operations
1 https://dvc.org/doc/user-guide/project-structure/dvcignore-files 2 https://git-scm.com/docs/gitignore
Introduction to Data Versioning with DVC

Example

# .dvcignore
# Ignore all files in the 'data' directory
data/*

# But don't ignore 'data/data.csv' !data/data.csv
# Ignore all .tmp files *.tmp
1 https://dvc.org/doc/user-guide/project-structure/dvcignore-files
Introduction to Data Versioning with DVC

Checking Ignored Files

  • Use dvc check-ignore command
$ dvc check-ignore data/file.txt
data/file.txt
  • Use with -d flag to get details
$ dvc check-ignore -d data/file.txt
.dvcignore:3:data/*    data/file.txt
1 https://dvc.org/doc/command-reference/check-ignore
Introduction to Data Versioning with DVC

Summary

  • Install DVC using pip install dvc
  • Verify version, platform etc. of DVC
    • dvc version
  • Initializing DVC in workspace
    • dvc init
    • Initialize Git first
  • .dvcignore files are used to specify excluded files
    • Similar to .gitignore, follows same syntax
    • Check if a specific file is excluded
      • dvc check-ignore <filename>
Introduction to Data Versioning with DVC

Let's practice!

Introduction to Data Versioning with DVC

Preparing Video For Download...