MLOps

Versionarea datelor pentru ML: DVC, lakeFS si Delta Lake comparate

Petru Constantin
--7 min lectura
#data versioning#DVC#lakeFS#Delta Lake#MLOps#data management

Versionarea datelor pentru ML: DVC, lakeFS si Delta Lake comparate

In machine learning, versionarea codului cu Git reprezinta doar o treime din puzzle-ul reproductibilitatii. Datele si artefactele modelelor se schimba independent de cod, iar fara o versionare corespunzatoare nu poti reproduce experimentele anterioare, audita deciziile modelelor sau compara fiabil versiunile de modele.

De ce ai nevoie de versionare a datelor

Ia in considerare acest scenariu: performanta modelului tau a scazut in productie. Pentru a investiga, trebuie sa raspunzi la:

  1. Pe ce date a fost antrenat modelul actual?
  2. Cum difera acele date de ce vede modelul acum?
  3. Pot reproduce antrenamentul exact pentru a verifica?

Fara versionarea datelor, raspunsul la toate trei este "nu stiu." Cu versionare, fiecare rulare de antrenament are o versiune de date blocata, si poti face diff, rollback si reproductie la cerere.

Comparatie instrumente

| Functionalitate | DVC | lakeFS | Delta Lake | |---------|-----|--------|------------| | Arhitectura | Extensie Git | Server tip Git | Strat de stocare | | Stocare | S3, GCS, Azure, local | S3 (via gateway) | S3, ADLS, HDFS | | Branching | Branch-uri Git | Branch-uri native | Time travel | | Merge | Git merge | Merge branch-uri | N/A | | Tipuri fisiere | Orice fisier | Orice fisier | Parquet/tabele | | Limita dimensiune | Niciuna (cloud storage) | Niciuna | Niciuna | | Motor query | Niciunul | Citire via S3 API | Spark, Presto, Trino | | Setup | CLI + stocare remote | Deployment server | Cluster Spark | | Ideal pentru | Experimente ML | Versionare data lake | Analytics + ML |

DVC (Data Version Control)

DVC extinde Git pentru a urmari fisiere mari si seturi de date. Stocheaza metadatele in Git in timp ce datele efective stau in stocare cloud.

Setup si utilizare de baza

# Initializeaza DVC in repo-ul Git
pip install dvc[s3]
dvc init
 
# Configureaza stocarea remote
dvc remote add -d myremote s3://ml-data-versioned/dvc-cache
dvc remote modify myremote region eu-central-1
 
# Urmareste un set de date
dvc add data/training/customers.parquet
git add data/training/customers.parquet.dvc data/training/.gitignore
git commit -m "Add training data v1.0"
 
# Trimite datele pe remote
dvc push

Pipeline DVC cu parametri

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    params:
      - prepare.split_ratio
      - prepare.random_seed
    outs:
      - data/processed/train.parquet
      - data/processed/test.parquet
 
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.parquet
    params:
      - train.model_type
      - train.n_estimators
      - train.max_depth
    outs:
      - models/model.joblib
    metrics:
      - metrics/train.json:
          cache: false
 
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.joblib
      - data/processed/test.parquet
    metrics:
      - metrics/eval.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          cache: false
      - plots/roc_curve.csv:
          cache: false
# params.yaml
prepare:
  split_ratio: 0.2
  random_seed: 42
 
train:
  model_type: gradient_boosting
  n_estimators: 200
  max_depth: 8

Urmarirea experimentelor cu DVC

# Ruleaza experimente cu parametri diferiti
dvc exp run -S train.n_estimators=100
dvc exp run -S train.n_estimators=200
dvc exp run -S train.n_estimators=300 -S train.max_depth=12
 
# Compara experimentele
dvc exp show
# ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
# ┃ Experiment          ┃ accuracy   ┃ f1      ┃ n_estimators  ┃ max_depth ┃
# ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
# │ main                │ 0.923      │ 0.891   │ 200           │ 8         │
# │ ├── exp-abc123     │ 0.918      │ 0.882   │ 100           │ 8         │
# │ ├── exp-def456     │ 0.923      │ 0.891   │ 200           │ 8         │
# │ └── exp-ghi789     │ 0.931      │ 0.905   │ 300           │ 12        │
 
# Aplica cel mai bun experiment
dvc exp apply exp-ghi789
git add .
git commit -m "Apply best model: n_estimators=300, max_depth=12"

Compararea datelor

# Vezi ce s-a schimbat intre commit-uri
dvc diff HEAD~1
# Modified: data/training/customers.parquet
#   size: 45.2 MB -> 48.7 MB
#   hash: abc123 -> def456
 
# Compara metricile intre tag-uri Git
dvc metrics diff v1.0 v2.0
# Path              Metric    Old     New     Change
# metrics/eval.json accuracy  0.91    0.93    0.02
# metrics/eval.json f1        0.87    0.90    0.03

lakeFS

lakeFS ofera branching si merging in stil Git pentru intregul data lake. Se aseaza ca un gateway intre aplicatiile tale si S3, prezentand o interfata versionata.

Setup si branching

# Instaleaza clientul lakeFS
pip install lakefs-sdk
 
# Creeaza un repository
lakectl repo create lakefs://ml-data s3://my-data-bucket/lakefs
 
# Creeaza un branch pentru un experiment
lakectl branch create lakefs://ml-data/experiment-v2 --source lakefs://ml-data/main

Utilizare Python SDK

import lakefs
 
# Configureaza clientul
client = lakefs.Client(
    host="http://lakefs.company.com:8000",
    username="access_key_id",
    password="secret_access_key",
)
 
repo = lakefs.Repository("ml-data", client=client)
 
# Creeaza branch de experiment
experiment_branch = repo.branch("experiment-new-features").create(source_reference="main")
 
# Incarca date noi de antrenament pe branch
with open("data/new_features.parquet", "rb") as f:
    experiment_branch.object("training/features.parquet").upload(f)
 
# Comite schimbarea
experiment_branch.commit(
    message="Adauga noi feature-uri de clienti pentru experiment",
    metadata={"experiment": "feature-expansion-v2", "author": "ml-team"},
)
 
# Dupa ce experimentul reuseste, merge in main
main_branch = repo.branch("main")
main_branch.merge(experiment_branch.id, message="Merge experiment reusit: feature-expansion-v2")

Citirea datelor via S3 API

import pandas as pd
 
# Citeste de pe branch-ul lakeFS (compatibil S3)
df = pd.read_parquet(
    "s3://ml-data/experiment-v2/training/features.parquet",
    storage_options={
        "endpoint_url": "http://lakefs.company.com:8000",
        "key": "access_key_id",
        "secret": "secret_access_key",
    },
)

Delta Lake

Delta Lake adauga tranzactii ACID, time travel si aplicarea schemei peste fisierele Parquet. Este standardul pentru pipeline-urile ML bazate pe Spark.

Time travel pentru ML

from delta import DeltaTable
from pyspark.sql import SparkSession
 
spark = SparkSession.builder \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0") \
    .getOrCreate()
 
# Scrie date de antrenament versionate
training_data.write.format("delta").mode("overwrite").save("s3://ml-data/training/customers")
 
# Citeste datele la o versiune specifica (pentru reproductibilitate)
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("s3://ml-data/training/customers")
 
# Citeste datele la un timestamp specific
df_snapshot = spark.read.format("delta") \
    .option("timestampAsOf", "2026-02-01T00:00:00") \
    .load("s3://ml-data/training/customers")
 
# Vizualizeaza istoricul
delta_table = DeltaTable.forPath(spark, "s3://ml-data/training/customers")
delta_table.history().show()

Aplicarea schemei

# Delta Lake aplica schema la scriere, previne coruperea pipeline-ului de date
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
 
expected_schema = StructType([
    StructField("customer_id", IntegerType(), nullable=False),
    StructField("age", IntegerType(), nullable=False),
    StructField("lifetime_value", DoubleType(), nullable=True),
    StructField("churn", IntegerType(), nullable=False),
])
 
# Aceasta va esua daca schema nu se potriveste, protejand pipeline-ul ML
new_data.write.format("delta") \
    .option("mergeSchema", "false") \
    .mode("append") \
    .save("s3://ml-data/training/customers")

Cadru de decizie

Alege DVC daca:

  • Echipa ta de ML foloseste Git zilnic
  • Experimentele sunt fluxul principal de lucru (nu ingineria datelor)
  • Vrei setup simplu (fara server)
  • Ai nevoie sa versionezi fisiere diverse (modele, configuratii, date)

Alege lakeFS daca:

  • Gestionezi un data lake cu multipli consumatori
  • Ai nevoie de experimentare bazata pe branch-uri cu datele
  • Mai multe echipe acceseaza aceleasi date
  • Compatibilitatea cu S3 API conteaza

Alege Delta Lake daca:

  • Folosesti Spark pentru procesarea datelor
  • Ai nevoie de tranzactii ACID pe date
  • Aplicarea schemei este critica
  • Query-urile de tip time travel sunt suficiente (nu ai nevoie de branching)

Integrarea cu stiva MLOps

Versionarea datelor se conecteaza la fiecare parte a ciclului de viata ML:

Blocajul de trei versiuni (revizitat)

Fiecare model in productie ar trebui sa blocheze trei versiuni:

{
  "model_version": "churn-v2.4",
  "code_version": "git:abc123",
  "data_version": "dvc:data/v2.4",
  "feature_version": "feast:customer_features/v3",
  "trained_at": "2026-02-10T03:00:00Z",
  "training_pipeline": "kubeflow:run-xyz789"
}

Fara versionarea datelor, ai un scaun cu doua picioare. Adaug-o, si sistemul tau ML devine complet reproductibil.


Ai nevoie de ajutor cu implementarea versionarii datelor pentru echipa ta de ML? DeviDevs construieste infrastructura MLOps reproductibila cu DVC, lakeFS sau Delta Lake. Solicita o evaluare gratuita ->

Ai nevoie de ajutor cu conformitatea EU AI Act sau securitatea AI?

Programeaza o consultatie gratuita de 30 de minute. Fara obligatii.

Programeaza un Apel

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.