ML CI/CD: Integrare Continua si Deployment pentru Machine Learning

CI/CD pentru machine learning este fundamental diferit de CI/CD-ul traditional din software. In software, testezi cod. In ML, testezi cod si date si calitatea modelului si infrastructura de serving. Acest ghid acopera cum sa construiesti pipeline-uri ML CI/CD suficient de fiabile pentru productie.

De ce ML CI/CD este diferit

| Aspect | Software CI/CD | ML CI/CD | |--------|---------------|----------| | Ce se schimba | Cod | Cod + Date + Model + Configuratie | | Tipuri de teste | Unit, integrare, e2e | Calitate date, calitate model, integrare, performanta | | Artefact de build | Container/binar | Artefact model + configuratie serving | | Trigger deployment | Push de cod | Push de cod SAU refresh de date SAU degradare performanta | | Rollback | Versiunea anterioara de cod | Versiunea anterioara de model (poate necesita features diferite) | | Mediu | Compute standard | Clustere GPU pentru antrenare, CPU/GPU pentru serving |

Arhitectura Pipeline ML CI/CD

Code Push / Data Refresh / Schedule
            │
            ▼
    ┌───────────────┐
    │ Data Validation│◄─── Schema checks, teste statistice, freshness
    └───────┬───────┘
            │ Pass
            ▼
    ┌───────────────┐
    │ Feature Compute│◄─── Feature engineering, transformare
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │    Training    │◄─── Configuratie hiperparametri, alocare compute
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ Model Testing  │◄─── Quality gates, verificari regresie, teste bias
    └───────┬───────┘
            │ Pass
            ▼
    ┌───────────────┐
    │   Registry     │◄─── Versionare, tagging, stocare in model registry
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Deploy       │◄─── Shadow → Canary → Production
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Monitor      │◄─── Drift, performanta, latenta
    └───────────────┘

GitHub Actions pentru ML CI/CD

Pipeline de Validare a Datelor

name: Data Validation
on:
  schedule:
    - cron: '0 1 * * *'  # Daily at 1 AM UTC
  workflow_dispatch:
    inputs:
      data_version:
        description: 'Data version to validate'
        required: false
 
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
 
      - name: Install dependencies
        run: pip install -r requirements/validation.txt
 
      - name: Pull latest data
        run: dvc pull data/processed/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
 
      - name: Run schema validation
        run: python pipelines/validate_schema.py
 
      - name: Run statistical tests
        run: python pipelines/validate_statistics.py
 
      - name: Run data quality checks
        run: python pipelines/validate_quality.py
 
      - name: Generate data profile
        run: python pipelines/generate_profile.py --output reports/data_profile.html
 
      - name: Upload validation report
        uses: actions/upload-artifact@v4
        with:
          name: data-validation-report
          path: reports/

Pipeline de Antrenare si Testare Model

name: ML Training Pipeline
on:
  push:
    paths:
      - 'src/models/**'
      - 'src/features/**'
      - 'configs/training/**'
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'MLflow experiment name'
        required: true
        default: 'production-training'
 
jobs:
  train:
    runs-on: [self-hosted, gpu]
    timeout-minutes: 120
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements/training.txt
 
      - name: Pull training data
        run: dvc pull data/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
 
      - name: Train model
        run: |
          python src/train.py \
            --config configs/training/production.yaml \
            --experiment ${{ inputs.experiment_name || 'production-training' }}
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
      - name: Run model tests
        run: pytest tests/model/ -v --tb=short
 
      - name: Run bias and fairness tests
        run: python tests/fairness/check_bias.py
 
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: artifacts/model/
 
  quality-gate:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Download model artifact
        uses: actions/download-artifact@v4
        with:
          name: trained-model
          path: artifacts/model/
 
      - name: Compare with production model
        run: |
          python pipelines/compare_models.py \
            --new-model artifacts/model/ \
            --production-model models:/churn-predictor/Production \
            --threshold 0.02
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
      - name: Register model if improved
        if: success()
        run: |
          python pipelines/register_model.py \
            --model-path artifacts/model/ \
            --name churn-predictor \
            --stage staging
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
  deploy:
    needs: quality-gate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
 
      - name: Deploy canary (10% traffic)
        run: |
          python pipelines/deploy.py \
            --model-name churn-predictor \
            --stage staging \
            --strategy canary \
            --traffic-split 10
        env:
          K8S_CLUSTER: ${{ secrets.K8S_CLUSTER }}
 
      - name: Wait and monitor canary
        run: python pipelines/monitor_canary.py --duration 30m --model churn-predictor
 
      - name: Promote to production
        run: |
          python pipelines/deploy.py \
            --model-name churn-predictor \
            --strategy promote

Framework de Testare a Modelelor

Suita de Teste Multi-nivel

"""
ML Model Test Suite, ruleaza in CI/CD dupa fiecare antrenare.
"""
import pytest
import numpy as np
import joblib
from pathlib import Path
 
MODEL_PATH = Path("artifacts/model/model.joblib")
TEST_DATA_PATH = Path("data/test/test.parquet")
 
@pytest.fixture(scope="session")
def model():
    return joblib.load(MODEL_PATH)
 
@pytest.fixture(scope="session")
def test_data():
    import pandas as pd
    return pd.read_parquet(TEST_DATA_PATH)
 
 
class TestModelAccuracy:
    """Quality gates: modelul trebuie sa atinga praguri minime de performanta."""
 
    def test_accuracy_above_threshold(self, model, test_data):
        from sklearn.metrics import accuracy_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        assert accuracy >= 0.85, f"Accuracy {accuracy:.3f} below 0.85 threshold"
 
    def test_no_regression_vs_baseline(self, model, test_data):
        """Modelul nou nu trebuie sa fie mai slab decat baseline-ul documentat."""
        from sklearn.metrics import f1_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred, average="weighted")
        BASELINE_F1 = 0.82  # Documented baseline from last stable release
        assert f1 >= BASELINE_F1 - 0.02, f"F1 {f1:.3f} regressed vs baseline {BASELINE_F1}"
 
 
class TestModelRobustness:
    """Verifica daca modelul gestioneaza corect cazurile limita."""
 
    def test_handles_missing_values(self, model):
        """Modelul nu ar trebui sa se blocheze pe input-uri NaN."""
        sample = np.full((1, model.n_features_in_), np.nan)
        try:
            model.predict(sample)
        except ValueError:
            pass  # Expected for models that don't handle NaN
        # Should not raise unexpected exceptions
 
    def test_prediction_determinism(self, model, test_data):
        """Acelasi input trebuie sa produca acelasi output."""
        X = test_data.drop("target", axis=1).head(10)
        pred1 = model.predict(X)
        pred2 = model.predict(X)
        np.testing.assert_array_equal(pred1, pred2)
 
    def test_prediction_latency(self, model, test_data):
        """O singura predictie trebuie sa fie suficient de rapida pentru SLA-ul de serving."""
        import time
        X_single = test_data.drop("target", axis=1).head(1)
        times = []
        for _ in range(100):
            start = time.perf_counter()
            model.predict(X_single)
            times.append((time.perf_counter() - start) * 1000)
        p99 = np.percentile(times, 99)
        assert p99 < 50, f"P99 latency {p99:.1f}ms exceeds 50ms SLA"
 
 
class TestModelFairness:
    """Verifica bias-ul intre grupuri protejate."""
 
    def test_equal_opportunity(self, model, test_data):
        """Rata de true positive trebuie sa fie similara intre grupuri."""
        from sklearn.metrics import recall_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
 
        if "demographic_group" not in test_data.columns:
            pytest.skip("No demographic column available")
 
        groups = test_data["demographic_group"].unique()
        tpr_by_group = {}
        for group in groups:
            mask = test_data["demographic_group"] == group
            if mask.sum() < 50:
                continue
            tpr = recall_score(y[mask], y_pred[mask], zero_division=0)
            tpr_by_group[group] = tpr
 
        if len(tpr_by_group) < 2:
            pytest.skip("Not enough groups for comparison")
 
        max_tpr = max(tpr_by_group.values())
        min_tpr = min(tpr_by_group.values())
        disparity = max_tpr - min_tpr
        assert disparity < 0.15, f"TPR disparity {disparity:.3f} exceeds 0.15 threshold: {tpr_by_group}"

Data Version Control in CI/CD

# dvc.yaml, defineste etape de pipeline reproductibile
stages:
  preprocess:
    cmd: python src/preprocess.py --config configs/preprocess.yaml
    deps:
      - src/preprocess.py
      - data/raw/
      - configs/preprocess.yaml
    outs:
      - data/processed/
 
  train:
    cmd: python src/train.py --config configs/training/production.yaml
    deps:
      - src/train.py
      - data/processed/
      - configs/training/production.yaml
    outs:
      - artifacts/model/
    metrics:
      - metrics/training.json:
          cache: false
    params:
      - configs/training/production.yaml:
          - model.n_estimators
          - model.max_depth
          - model.learning_rate

# In CI: reproduce pipeline-ul si verifica modificarile
dvc repro
dvc metrics diff  # Compara metricile cu rularea anterioara
dvc plots diff    # Genereaza comparatie vizuala

Strategii de Deployment pentru Modele ML

Blue-Green Deployment

def blue_green_deploy(model_name: str, new_version: str):
    """Comuta traficul atomic intre versiuni de model."""
    # Deploy versiunea noua pe endpoint-ul "green"
    deploy_to_endpoint(model_name, new_version, endpoint="green")
 
    # Ruleaza smoke tests pe green
    if smoke_test(endpoint="green"):
        # Comuta traficul de la blue la green
        switch_traffic(model_name, from_endpoint="blue", to_endpoint="green")
        # Pastreaza blue ca rollback
    else:
        # Dezactiveaza deployment-ul green esuat
        teardown_endpoint("green")
        raise DeploymentError(f"Smoke tests failed for {model_name}:{new_version}")

Rollout Progresiv

ROLLOUT_STAGES = [
    {"traffic_pct": 5, "duration_minutes": 15, "error_threshold": 0.02},
    {"traffic_pct": 25, "duration_minutes": 30, "error_threshold": 0.015},
    {"traffic_pct": 50, "duration_minutes": 60, "error_threshold": 0.01},
    {"traffic_pct": 100, "duration_minutes": 0, "error_threshold": 0.01},
]
 
async def progressive_rollout(model_name: str, new_version: str):
    for stage in ROLLOUT_STAGES:
        set_traffic_split(model_name, new_version, stage["traffic_pct"])
 
        if stage["duration_minutes"] > 0:
            metrics = await monitor_for(minutes=stage["duration_minutes"])
            if metrics["error_rate"] > stage["error_threshold"]:
                rollback(model_name)
                raise RolloutError(f"Error rate {metrics['error_rate']:.3f} exceeded threshold at {stage['traffic_pct']}%")
 
    mark_as_production(model_name, new_version)

Concluzii

Testeaza datele la fel de riguros ca si codul: Validare de schema, verificari statistice si monitorizare freshness
Quality gates inainte de deployment: Modelele trebuie sa bata baseline-urile, nu doar sa treaca teste unitare
Deployment progresiv: Nu trece niciodata de la 0% la 100% trafic instantaneu
Versioneaza totul: Cod, date, model si configuratie trebuie sa fie reproductibile
Automatizeaza reantrenarea: Programata sau bazata pe triggere, cu aprobare umana pentru promovare in productie

Resurse Conexe

Cele mai bune practici MLOps pentru ghidul complet de productie
Tutorial MLflow pentru experiment tracking in CI/CD-ul tau
Monitorizare modele pentru a inchide bucla dupa deployment
Securitate CI/CD cu GitHub Actions pentru securizarea pipeline-urilor ML

Construiesti ML CI/CD? DeviDevs implementeaza pipeline-uri MLOps end-to-end cu testare automatizata, deployment progresiv si monitorizare. Obtine o evaluare gratuita ->

Sistemul tau AI e conform cu EU AI Act? Evaluare gratuita de risc - afla in 2 minute →

ML CI/CD: integrare si deployment pentru ML