Data Versioning for ML: DVC, lakeFS, and Delta Lake Compared
In machine learning, code versioning with Git is only one-third of the reproducibility puzzle. Data and model artifacts change independently of code, and without proper versioning, you can't reproduce past experiments, audit model decisions, or reliably compare model versions.
Why You Need Data Versioning
Consider this scenario: your model performance dropped in production. To investigate, you need to answer:
- What data was the current model trained on?
- How is that data different from what the model sees now?
- Can I reproduce the training exactly to verify?
Without data versioning, the answer to all three is "I don't know." With versioning, each training run has a locked data version, and you can diff, rollback, and reproduce at will.
Tool Comparison
| Feature | DVC | lakeFS | Delta Lake | |---------|-----|--------|------------| | Architecture | Git extension | Git-like server | Storage layer | | Storage | S3, GCS, Azure, local | S3 (via gateway) | S3, ADLS, HDFS | | Branching | Git branches | Native branches | Time travel | | Merge | Git merge | Merge branches | N/A | | File types | Any file | Any file | Parquet/tables | | Size limit | None (cloud storage) | None | None | | Query engine | None | Read via S3 API | Spark, Presto, Trino | | Setup | CLI + remote storage | Server deployment | Spark cluster | | Best for | ML experiments | Data lake versioning | Analytics + ML |
DVC (Data Version Control)
DVC extends Git to track large files and datasets. It stores metadata in Git while actual data lives in cloud storage.
Setup and Basic Usage
# Initialize DVC in your Git repo
pip install dvc[s3]
dvc init
# Configure remote storage
dvc remote add -d myremote s3://ml-data-versioned/dvc-cache
dvc remote modify myremote region eu-central-1
# Track a dataset
dvc add data/training/customers.parquet
git add data/training/customers.parquet.dvc data/training/.gitignore
git commit -m "Add training data v1.0"
# Push data to remote
dvc pushDVC Pipeline with Params
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
params:
- prepare.split_ratio
- prepare.random_seed
outs:
- data/processed/train.parquet
- data/processed/test.parquet
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/train.parquet
params:
- train.model_type
- train.n_estimators
- train.max_depth
outs:
- models/model.joblib
metrics:
- metrics/train.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.joblib
- data/processed/test.parquet
metrics:
- metrics/eval.json:
cache: false
plots:
- plots/confusion_matrix.csv:
cache: false
- plots/roc_curve.csv:
cache: false# params.yaml
prepare:
split_ratio: 0.2
random_seed: 42
train:
model_type: gradient_boosting
n_estimators: 200
max_depth: 8Experiment Tracking with DVC
# Run experiments with different parameters
dvc exp run -S train.n_estimators=100
dvc exp run -S train.n_estimators=200
dvc exp run -S train.n_estimators=300 -S train.max_depth=12
# Compare experiments
dvc exp show
# ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
# ┃ Experiment ┃ accuracy ┃ f1 ┃ n_estimators ┃ max_depth ┃
# ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
# │ main │ 0.923 │ 0.891 │ 200 │ 8 │
# │ ├── exp-abc123 │ 0.918 │ 0.882 │ 100 │ 8 │
# │ ├── exp-def456 │ 0.923 │ 0.891 │ 200 │ 8 │
# │ └── exp-ghi789 │ 0.931 │ 0.905 │ 300 │ 12 │
# Apply best experiment
dvc exp apply exp-ghi789
git add .
git commit -m "Apply best model: n_estimators=300, max_depth=12"Data Diffing
# See what changed between commits
dvc diff HEAD~1
# Modified: data/training/customers.parquet
# size: 45.2 MB -> 48.7 MB
# hash: abc123 -> def456
# Compare metrics across Git tags
dvc metrics diff v1.0 v2.0
# Path Metric Old New Change
# metrics/eval.json accuracy 0.91 0.93 0.02
# metrics/eval.json f1 0.87 0.90 0.03lakeFS
lakeFS provides Git-like branching and merging for your entire data lake. It sits as a gateway between your applications and S3, presenting a versioned interface.
Setup and Branching
# Install lakeFS client
pip install lakefs-sdk
# Create a repository
lakectl repo create lakefs://ml-data s3://my-data-bucket/lakefs
# Create a branch for an experiment
lakectl branch create lakefs://ml-data/experiment-v2 --source lakefs://ml-data/mainPython SDK Usage
import lakefs
# Configure client
client = lakefs.Client(
host="http://lakefs.company.com:8000",
username="access_key_id",
password="secret_access_key",
)
repo = lakefs.Repository("ml-data", client=client)
# Create experiment branch
experiment_branch = repo.branch("experiment-new-features").create(source_reference="main")
# Upload new training data to the branch
with open("data/new_features.parquet", "rb") as f:
experiment_branch.object("training/features.parquet").upload(f)
# Commit the change
experiment_branch.commit(
message="Add new customer features for experiment",
metadata={"experiment": "feature-expansion-v2", "author": "ml-team"},
)
# After experiment succeeds, merge to main
main_branch = repo.branch("main")
main_branch.merge(experiment_branch.id, message="Merge successful experiment: feature-expansion-v2")Reading Data via S3 API
import pandas as pd
# Read from lakeFS branch (S3-compatible)
df = pd.read_parquet(
"s3://ml-data/experiment-v2/training/features.parquet",
storage_options={
"endpoint_url": "http://lakefs.company.com:8000",
"key": "access_key_id",
"secret": "secret_access_key",
},
)Delta Lake
Delta Lake adds ACID transactions, time travel, and schema enforcement to Parquet files. It's the standard for Spark-based ML pipelines.
Time Travel for ML
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0") \
.getOrCreate()
# Write versioned training data
training_data.write.format("delta").mode("overwrite").save("s3://ml-data/training/customers")
# Read data at a specific version (for reproducibility)
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("s3://ml-data/training/customers")
# Read data at a specific timestamp
df_snapshot = spark.read.format("delta") \
.option("timestampAsOf", "2026-02-01T00:00:00") \
.load("s3://ml-data/training/customers")
# View history
delta_table = DeltaTable.forPath(spark, "s3://ml-data/training/customers")
delta_table.history().show()Schema Enforcement
# Delta Lake enforces schema on write — prevents data pipeline corruption
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
expected_schema = StructType([
StructField("customer_id", IntegerType(), nullable=False),
StructField("age", IntegerType(), nullable=False),
StructField("lifetime_value", DoubleType(), nullable=True),
StructField("churn", IntegerType(), nullable=False),
])
# This will fail if schema doesn't match — protecting your ML pipeline
new_data.write.format("delta") \
.option("mergeSchema", "false") \
.mode("append") \
.save("s3://ml-data/training/customers")Decision Framework
Choose DVC if:
- Your ML team uses Git daily
- Experiments are the primary workflow (not data engineering)
- You want lightweight setup (no server)
- You need to version arbitrary files (models, configs, data)
Choose lakeFS if:
- You manage a data lake with multiple consumers
- You need branch-based data experimentation
- Multiple teams access the same data
- S3 API compatibility matters
Choose Delta Lake if:
- You use Spark for data processing
- You need ACID transactions on data
- Schema enforcement is critical
- Time travel queries are sufficient (no branching needed)
Integrating with the MLOps Stack
Data versioning connects to every part of the ML lifecycle:
- Experiment tracking — Link each MLflow run to a specific data version
- Feature stores — Version the feature computation code alongside data
- ML pipelines — Each pipeline run pins a data version
- Model monitoring — Compare production data against versioned training data
The Three-Way Version Lock (Revisited)
Every production model should lock three versions:
{
"model_version": "churn-v2.4",
"code_version": "git:abc123",
"data_version": "dvc:data/v2.4",
"feature_version": "feast:customer_features/v3",
"trained_at": "2026-02-10T03:00:00Z",
"training_pipeline": "kubeflow:run-xyz789"
}Without data versioning, you have a two-legged stool. Add it, and your ML system becomes fully reproducible.
Need help implementing data versioning for your ML team? DeviDevs builds reproducible MLOps infrastructure with DVC, lakeFS, or Delta Lake. Get a free assessment →