MLOps

ML CI/CD: Continuous Integration and Deployment for Machine Learning

DeviDevs Team
7 min read
#ML CI/CD#MLOps#model deployment#continuous integration#GitHub Actions#model testing

ML CI/CD: Continuous Integration and Deployment for Machine Learning

CI/CD for machine learning is fundamentally different from traditional software CI/CD. In software, you test code. In ML, you test code and data and model quality and serving infrastructure. This guide covers how to build ML CI/CD pipelines that are reliable enough for production.

Why ML CI/CD Is Different

| Aspect | Software CI/CD | ML CI/CD | |--------|---------------|----------| | What changes | Code | Code + Data + Model + Config | | Test types | Unit, integration, e2e | Data quality, model quality, integration, performance | | Build artifact | Container/binary | Model artifact + serving config | | Deployment trigger | Code push | Code push OR data refresh OR performance degradation | | Rollback | Previous code version | Previous model version (may need different features) | | Environment | Standard compute | GPU clusters for training, CPU/GPU for serving |

ML CI/CD Pipeline Architecture

Code Push / Data Refresh / Schedule
            │
            ▼
    ┌───────────────┐
    │ Data Validation│◄─── Schema checks, statistical tests, freshness
    └───────┬───────┘
            │ Pass
            ▼
    ┌───────────────┐
    │ Feature Compute│◄─── Feature engineering, transformation
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │    Training    │◄─── Hyperparameter config, compute allocation
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ Model Testing  │◄─── Quality gates, regression checks, bias tests
    └───────┬───────┘
            │ Pass
            ▼
    ┌───────────────┐
    │   Registry     │◄─── Version, tag, store in model registry
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Deploy       │◄─── Shadow → Canary → Production
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Monitor      │◄─── Drift, performance, latency
    └───────────────┘

GitHub Actions for ML CI/CD

Data Validation Pipeline

name: Data Validation
on:
  schedule:
    - cron: '0 1 * * *'  # Daily at 1 AM UTC
  workflow_dispatch:
    inputs:
      data_version:
        description: 'Data version to validate'
        required: false
 
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
 
      - name: Install dependencies
        run: pip install -r requirements/validation.txt
 
      - name: Pull latest data
        run: dvc pull data/processed/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
 
      - name: Run schema validation
        run: python pipelines/validate_schema.py
 
      - name: Run statistical tests
        run: python pipelines/validate_statistics.py
 
      - name: Run data quality checks
        run: python pipelines/validate_quality.py
 
      - name: Generate data profile
        run: python pipelines/generate_profile.py --output reports/data_profile.html
 
      - name: Upload validation report
        uses: actions/upload-artifact@v4
        with:
          name: data-validation-report
          path: reports/

Training and Model Testing Pipeline

name: ML Training Pipeline
on:
  push:
    paths:
      - 'src/models/**'
      - 'src/features/**'
      - 'configs/training/**'
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'MLflow experiment name'
        required: true
        default: 'production-training'
 
jobs:
  train:
    runs-on: [self-hosted, gpu]
    timeout-minutes: 120
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: pip install -r requirements/training.txt
 
      - name: Pull training data
        run: dvc pull data/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
 
      - name: Train model
        run: |
          python src/train.py \
            --config configs/training/production.yaml \
            --experiment ${{ inputs.experiment_name || 'production-training' }}
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
      - name: Run model tests
        run: pytest tests/model/ -v --tb=short
 
      - name: Run bias and fairness tests
        run: python tests/fairness/check_bias.py
 
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: artifacts/model/
 
  quality-gate:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Download model artifact
        uses: actions/download-artifact@v4
        with:
          name: trained-model
          path: artifacts/model/
 
      - name: Compare with production model
        run: |
          python pipelines/compare_models.py \
            --new-model artifacts/model/ \
            --production-model models:/churn-predictor/Production \
            --threshold 0.02
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
      - name: Register model if improved
        if: success()
        run: |
          python pipelines/register_model.py \
            --model-path artifacts/model/ \
            --name churn-predictor \
            --stage staging
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
 
  deploy:
    needs: quality-gate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
 
      - name: Deploy canary (10% traffic)
        run: |
          python pipelines/deploy.py \
            --model-name churn-predictor \
            --stage staging \
            --strategy canary \
            --traffic-split 10
        env:
          K8S_CLUSTER: ${{ secrets.K8S_CLUSTER }}
 
      - name: Wait and monitor canary
        run: python pipelines/monitor_canary.py --duration 30m --model churn-predictor
 
      - name: Promote to production
        run: |
          python pipelines/deploy.py \
            --model-name churn-predictor \
            --strategy promote

Model Testing Framework

Multi-level Test Suite

"""
ML Model Test Suite — runs in CI/CD after every training.
"""
import pytest
import numpy as np
import joblib
from pathlib import Path
 
MODEL_PATH = Path("artifacts/model/model.joblib")
TEST_DATA_PATH = Path("data/test/test.parquet")
 
@pytest.fixture(scope="session")
def model():
    return joblib.load(MODEL_PATH)
 
@pytest.fixture(scope="session")
def test_data():
    import pandas as pd
    return pd.read_parquet(TEST_DATA_PATH)
 
 
class TestModelAccuracy:
    """Quality gates: model must meet minimum performance thresholds."""
 
    def test_accuracy_above_threshold(self, model, test_data):
        from sklearn.metrics import accuracy_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        assert accuracy >= 0.85, f"Accuracy {accuracy:.3f} below 0.85 threshold"
 
    def test_no_regression_vs_baseline(self, model, test_data):
        """New model must not be worse than documented baseline."""
        from sklearn.metrics import f1_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred, average="weighted")
        BASELINE_F1 = 0.82  # Documented baseline from last stable release
        assert f1 >= BASELINE_F1 - 0.02, f"F1 {f1:.3f} regressed vs baseline {BASELINE_F1}"
 
 
class TestModelRobustness:
    """Ensure model handles edge cases gracefully."""
 
    def test_handles_missing_values(self, model):
        """Model should not crash on NaN inputs."""
        sample = np.full((1, model.n_features_in_), np.nan)
        try:
            model.predict(sample)
        except ValueError:
            pass  # Expected for models that don't handle NaN
        # Should not raise unexpected exceptions
 
    def test_prediction_determinism(self, model, test_data):
        """Same input should produce same output."""
        X = test_data.drop("target", axis=1).head(10)
        pred1 = model.predict(X)
        pred2 = model.predict(X)
        np.testing.assert_array_equal(pred1, pred2)
 
    def test_prediction_latency(self, model, test_data):
        """Single prediction must be fast enough for serving SLA."""
        import time
        X_single = test_data.drop("target", axis=1).head(1)
        times = []
        for _ in range(100):
            start = time.perf_counter()
            model.predict(X_single)
            times.append((time.perf_counter() - start) * 1000)
        p99 = np.percentile(times, 99)
        assert p99 < 50, f"P99 latency {p99:.1f}ms exceeds 50ms SLA"
 
 
class TestModelFairness:
    """Check for bias across protected groups."""
 
    def test_equal_opportunity(self, model, test_data):
        """True positive rate should be similar across groups."""
        from sklearn.metrics import recall_score
        X = test_data.drop("target", axis=1)
        y = test_data["target"]
        y_pred = model.predict(X)
 
        if "demographic_group" not in test_data.columns:
            pytest.skip("No demographic column available")
 
        groups = test_data["demographic_group"].unique()
        tpr_by_group = {}
        for group in groups:
            mask = test_data["demographic_group"] == group
            if mask.sum() < 50:
                continue
            tpr = recall_score(y[mask], y_pred[mask], zero_division=0)
            tpr_by_group[group] = tpr
 
        if len(tpr_by_group) < 2:
            pytest.skip("Not enough groups for comparison")
 
        max_tpr = max(tpr_by_group.values())
        min_tpr = min(tpr_by_group.values())
        disparity = max_tpr - min_tpr
        assert disparity < 0.15, f"TPR disparity {disparity:.3f} exceeds 0.15 threshold: {tpr_by_group}"

Data Version Control in CI/CD

# dvc.yaml — define reproducible pipeline stages
stages:
  preprocess:
    cmd: python src/preprocess.py --config configs/preprocess.yaml
    deps:
      - src/preprocess.py
      - data/raw/
      - configs/preprocess.yaml
    outs:
      - data/processed/
 
  train:
    cmd: python src/train.py --config configs/training/production.yaml
    deps:
      - src/train.py
      - data/processed/
      - configs/training/production.yaml
    outs:
      - artifacts/model/
    metrics:
      - metrics/training.json:
          cache: false
    params:
      - configs/training/production.yaml:
          - model.n_estimators
          - model.max_depth
          - model.learning_rate
# In CI: reproduce the pipeline and check for changes
dvc repro
dvc metrics diff  # Compare metrics vs. previous run
dvc plots diff    # Generate visual comparison

Deployment Strategies for ML Models

Blue-Green Deployment

def blue_green_deploy(model_name: str, new_version: str):
    """Switch traffic atomically between model versions."""
    # Deploy new version to "green" endpoint
    deploy_to_endpoint(model_name, new_version, endpoint="green")
 
    # Run smoke tests against green
    if smoke_test(endpoint="green"):
        # Switch traffic from blue to green
        switch_traffic(model_name, from_endpoint="blue", to_endpoint="green")
        # Keep blue as rollback
    else:
        # Tear down failed green deployment
        teardown_endpoint("green")
        raise DeploymentError(f"Smoke tests failed for {model_name}:{new_version}")

Progressive Rollout

ROLLOUT_STAGES = [
    {"traffic_pct": 5, "duration_minutes": 15, "error_threshold": 0.02},
    {"traffic_pct": 25, "duration_minutes": 30, "error_threshold": 0.015},
    {"traffic_pct": 50, "duration_minutes": 60, "error_threshold": 0.01},
    {"traffic_pct": 100, "duration_minutes": 0, "error_threshold": 0.01},
]
 
async def progressive_rollout(model_name: str, new_version: str):
    for stage in ROLLOUT_STAGES:
        set_traffic_split(model_name, new_version, stage["traffic_pct"])
 
        if stage["duration_minutes"] > 0:
            metrics = await monitor_for(minutes=stage["duration_minutes"])
            if metrics["error_rate"] > stage["error_threshold"]:
                rollback(model_name)
                raise RolloutError(f"Error rate {metrics['error_rate']:.3f} exceeded threshold at {stage['traffic_pct']}%")
 
    mark_as_production(model_name, new_version)

Key Takeaways

  1. Test data as rigorously as code — Schema validation, statistical checks, and freshness monitoring
  2. Quality gates before deployment — Models must beat baselines, not just pass unit tests
  3. Progressive deployment — Never go from 0% to 100% traffic instantly
  4. Version everything — Code, data, model, and config must be reproducible
  5. Automate retraining — Schedule or trigger-based, with human approval for production promotion

Building ML CI/CD? DeviDevs implements end-to-end MLOps pipelines with automated testing, progressive deployment, and monitoring. Get a free assessment →

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.