Data Versioning for ML: DVC, lakeFS, and Delta Lake Compared

In machine learning, code versioning with Git is only one-third of the reproducibility puzzle. Data and model artifacts change independently of code, and without proper versioning, you can't reproduce past experiments, audit model decisions, or reliably compare model versions.

Why You Need Data Versioning

Consider this scenario: your model performance dropped in production. To investigate, you need to answer:

What data was the current model trained on?
How is that data different from what the model sees now?
Can I reproduce the training exactly to verify?

Without data versioning, the answer to all three is "I don't know." With versioning, each training run has a locked data version, and you can diff, rollback, and reproduce at will.

Tool Comparison

| Feature | DVC | lakeFS | Delta Lake | |---------|-----|--------|------------| | Architecture | Git extension | Git-like server | Storage layer | | Storage | S3, GCS, Azure, local | S3 (via gateway) | S3, ADLS, HDFS | | Branching | Git branches | Native branches | Time travel | | Merge | Git merge | Merge branches | N/A | | File types | Any file | Any file | Parquet/tables | | Size limit | None (cloud storage) | None | None | | Query engine | None | Read via S3 API | Spark, Presto, Trino | | Setup | CLI + remote storage | Server deployment | Spark cluster | | Best for | ML experiments | Data lake versioning | Analytics + ML |

DVC (Data Version Control)

DVC extends Git to track large files and datasets. It stores metadata in Git while actual data lives in cloud storage.

Setup and Basic Usage

# Initialize DVC in your Git repo
pip install dvc[s3]
dvc init
 
# Configure remote storage
dvc remote add -d myremote s3://ml-data-versioned/dvc-cache
dvc remote modify myremote region eu-central-1
 
# Track a dataset
dvc add data/training/customers.parquet
git add data/training/customers.parquet.dvc data/training/.gitignore
git commit -m "Add training data v1.0"
 
# Push data to remote
dvc push

DVC Pipeline with Params

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    params:
      - prepare.split_ratio
      - prepare.random_seed
    outs:
      - data/processed/train.parquet
      - data/processed/test.parquet
 
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.parquet
    params:
      - train.model_type
      - train.n_estimators
      - train.max_depth
    outs:
      - models/model.joblib
    metrics:
      - metrics/train.json:
          cache: false
 
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.joblib
      - data/processed/test.parquet
    metrics:
      - metrics/eval.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          cache: false
      - plots/roc_curve.csv:
          cache: false

# params.yaml
prepare:
  split_ratio: 0.2
  random_seed: 42
 
train:
  model_type: gradient_boosting
  n_estimators: 200
  max_depth: 8

Experiment Tracking with DVC

# Run experiments with different parameters
dvc exp run -S train.n_estimators=100
dvc exp run -S train.n_estimators=200
dvc exp run -S train.n_estimators=300 -S train.max_depth=12
 
# Compare experiments
dvc exp show
# ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
# ┃ Experiment          ┃ accuracy   ┃ f1      ┃ n_estimators  ┃ max_depth ┃
# ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
# │ main                │ 0.923      │ 0.891   │ 200           │ 8         │
# │ ├── exp-abc123     │ 0.918      │ 0.882   │ 100           │ 8         │
# │ ├── exp-def456     │ 0.923      │ 0.891   │ 200           │ 8         │
# │ └── exp-ghi789     │ 0.931      │ 0.905   │ 300           │ 12        │
 
# Apply best experiment
dvc exp apply exp-ghi789
git add .
git commit -m "Apply best model: n_estimators=300, max_depth=12"

Data Diffing

# See what changed between commits
dvc diff HEAD~1
# Modified: data/training/customers.parquet
#   size: 45.2 MB -> 48.7 MB
#   hash: abc123 -> def456
 
# Compare metrics across Git tags
dvc metrics diff v1.0 v2.0
# Path              Metric    Old     New     Change
# metrics/eval.json accuracy  0.91    0.93    0.02
# metrics/eval.json f1        0.87    0.90    0.03

lakeFS

lakeFS provides Git-like branching and merging for your entire data lake. It sits as a gateway between your applications and S3, presenting a versioned interface.

Setup and Branching

# Install lakeFS client
pip install lakefs-sdk
 
# Create a repository
lakectl repo create lakefs://ml-data s3://my-data-bucket/lakefs
 
# Create a branch for an experiment
lakectl branch create lakefs://ml-data/experiment-v2 --source lakefs://ml-data/main

Python SDK Usage

import lakefs
 
# Configure client
client = lakefs.Client(
    host="http://lakefs.company.com:8000",
    username="access_key_id",
    password="secret_access_key",
)
 
repo = lakefs.Repository("ml-data", client=client)
 
# Create experiment branch
experiment_branch = repo.branch("experiment-new-features").create(source_reference="main")
 
# Upload new training data to the branch
with open("data/new_features.parquet", "rb") as f:
    experiment_branch.object("training/features.parquet").upload(f)
 
# Commit the change
experiment_branch.commit(
    message="Add new customer features for experiment",
    metadata={"experiment": "feature-expansion-v2", "author": "ml-team"},
)
 
# After experiment succeeds, merge to main
main_branch = repo.branch("main")
main_branch.merge(experiment_branch.id, message="Merge successful experiment: feature-expansion-v2")

Reading Data via S3 API

import pandas as pd
 
# Read from lakeFS branch (S3-compatible)
df = pd.read_parquet(
    "s3://ml-data/experiment-v2/training/features.parquet",
    storage_options={
        "endpoint_url": "http://lakefs.company.com:8000",
        "key": "access_key_id",
        "secret": "secret_access_key",
    },
)

Delta Lake

Delta Lake adds ACID transactions, time travel, and schema enforcement to Parquet files. It's the standard for Spark-based ML pipelines.

Time Travel for ML

from delta import DeltaTable
from pyspark.sql import SparkSession
 
spark = SparkSession.builder \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0") \
    .getOrCreate()
 
# Write versioned training data
training_data.write.format("delta").mode("overwrite").save("s3://ml-data/training/customers")
 
# Read data at a specific version (for reproducibility)
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("s3://ml-data/training/customers")
 
# Read data at a specific timestamp
df_snapshot = spark.read.format("delta") \
    .option("timestampAsOf", "2026-02-01T00:00:00") \
    .load("s3://ml-data/training/customers")
 
# View history
delta_table = DeltaTable.forPath(spark, "s3://ml-data/training/customers")
delta_table.history().show()

Schema Enforcement

# Delta Lake enforces schema on write — prevents data pipeline corruption
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
 
expected_schema = StructType([
    StructField("customer_id", IntegerType(), nullable=False),
    StructField("age", IntegerType(), nullable=False),
    StructField("lifetime_value", DoubleType(), nullable=True),
    StructField("churn", IntegerType(), nullable=False),
])
 
# This will fail if schema doesn't match — protecting your ML pipeline
new_data.write.format("delta") \
    .option("mergeSchema", "false") \
    .mode("append") \
    .save("s3://ml-data/training/customers")

Decision Framework

Choose DVC if:

Your ML team uses Git daily
Experiments are the primary workflow (not data engineering)
You want lightweight setup (no server)
You need to version arbitrary files (models, configs, data)

Choose lakeFS if:

You manage a data lake with multiple consumers
You need branch-based data experimentation
Multiple teams access the same data
S3 API compatibility matters

Choose Delta Lake if:

You use Spark for data processing
You need ACID transactions on data
Schema enforcement is critical
Time travel queries are sufficient (no branching needed)

Integrating with the MLOps Stack

Data versioning connects to every part of the ML lifecycle:

Experiment tracking — Link each MLflow run to a specific data version
Feature stores — Version the feature computation code alongside data
ML pipelines — Each pipeline run pins a data version
Model monitoring — Compare production data against versioned training data

The Three-Way Version Lock (Revisited)

Every production model should lock three versions:

{
  "model_version": "churn-v2.4",
  "code_version": "git:abc123",
  "data_version": "dvc:data/v2.4",
  "feature_version": "feast:customer_features/v3",
  "trained_at": "2026-02-10T03:00:00Z",
  "training_pipeline": "kubeflow:run-xyz789"
}

Without data versioning, you have a two-legged stool. Add it, and your ML system becomes fully reproducible.

Need help implementing data versioning for your ML team? DeviDevs builds reproducible MLOps infrastructure with DVC, lakeFS, or Delta Lake. Get a free assessment →

Data Versioning for ML: DVC, lakeFS, and Delta Lake Compared

Data Versioning for ML: DVC, lakeFS, and Delta Lake Compared

Why You Need Data Versioning

Tool Comparison

DVC (Data Version Control)

Setup and Basic Usage

DVC Pipeline with Params

Experiment Tracking with DVC

Data Diffing

lakeFS

Setup and Branching

Python SDK Usage

Reading Data via S3 API

Delta Lake

Time Travel for ML

Schema Enforcement

Decision Framework

Choose DVC if:

Choose lakeFS if:

Choose Delta Lake if:

Integrating with the MLOps Stack

The Three-Way Version Lock (Revisited)

Weekly AI Security & Automation Digest

Related Articles

MLOps for Small Teams: Building ML Infrastructure Without the Complexity

From Jupyter Notebook to Production: A Practical MLOps Migration Guide

ML Experiment Tracking: Best Practices for Reproducible Machine Learning