ML CI/CD: Continuous Integration and Deployment for Machine Learning
CI/CD for machine learning is fundamentally different from traditional software CI/CD. In software, you test code. In ML, you test code and data and model quality and serving infrastructure. This guide covers how to build ML CI/CD pipelines that are reliable enough for production.
Why ML CI/CD Is Different
| Aspect | Software CI/CD | ML CI/CD | |--------|---------------|----------| | What changes | Code | Code + Data + Model + Config | | Test types | Unit, integration, e2e | Data quality, model quality, integration, performance | | Build artifact | Container/binary | Model artifact + serving config | | Deployment trigger | Code push | Code push OR data refresh OR performance degradation | | Rollback | Previous code version | Previous model version (may need different features) | | Environment | Standard compute | GPU clusters for training, CPU/GPU for serving |
ML CI/CD Pipeline Architecture
Code Push / Data Refresh / Schedule
│
▼
┌───────────────┐
│ Data Validation│◄─── Schema checks, statistical tests, freshness
└───────┬───────┘
│ Pass
▼
┌───────────────┐
│ Feature Compute│◄─── Feature engineering, transformation
└───────┬───────┘
│
▼
┌───────────────┐
│ Training │◄─── Hyperparameter config, compute allocation
└───────┬───────┘
│
▼
┌───────────────┐
│ Model Testing │◄─── Quality gates, regression checks, bias tests
└───────┬───────┘
│ Pass
▼
┌───────────────┐
│ Registry │◄─── Version, tag, store in model registry
└───────┬───────┘
│
▼
┌───────────────┐
│ Deploy │◄─── Shadow → Canary → Production
└───────┬───────┘
│
▼
┌───────────────┐
│ Monitor │◄─── Drift, performance, latency
└───────────────┘
GitHub Actions for ML CI/CD
Data Validation Pipeline
name: Data Validation
on:
schedule:
- cron: '0 1 * * *' # Daily at 1 AM UTC
workflow_dispatch:
inputs:
data_version:
description: 'Data version to validate'
required: false
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements/validation.txt
- name: Pull latest data
run: dvc pull data/processed/
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Run schema validation
run: python pipelines/validate_schema.py
- name: Run statistical tests
run: python pipelines/validate_statistics.py
- name: Run data quality checks
run: python pipelines/validate_quality.py
- name: Generate data profile
run: python pipelines/generate_profile.py --output reports/data_profile.html
- name: Upload validation report
uses: actions/upload-artifact@v4
with:
name: data-validation-report
path: reports/Training and Model Testing Pipeline
name: ML Training Pipeline
on:
push:
paths:
- 'src/models/**'
- 'src/features/**'
- 'configs/training/**'
workflow_dispatch:
inputs:
experiment_name:
description: 'MLflow experiment name'
required: true
default: 'production-training'
jobs:
train:
runs-on: [self-hosted, gpu]
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements/training.txt
- name: Pull training data
run: dvc pull data/
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Train model
run: |
python src/train.py \
--config configs/training/production.yaml \
--experiment ${{ inputs.experiment_name || 'production-training' }}
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
- name: Run model tests
run: pytest tests/model/ -v --tb=short
- name: Run bias and fairness tests
run: python tests/fairness/check_bias.py
- name: Upload model artifact
uses: actions/upload-artifact@v4
with:
name: trained-model
path: artifacts/model/
quality-gate:
needs: train
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download model artifact
uses: actions/download-artifact@v4
with:
name: trained-model
path: artifacts/model/
- name: Compare with production model
run: |
python pipelines/compare_models.py \
--new-model artifacts/model/ \
--production-model models:/churn-predictor/Production \
--threshold 0.02
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
- name: Register model if improved
if: success()
run: |
python pipelines/register_model.py \
--model-path artifacts/model/ \
--name churn-predictor \
--stage staging
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
deploy:
needs: quality-gate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy canary (10% traffic)
run: |
python pipelines/deploy.py \
--model-name churn-predictor \
--stage staging \
--strategy canary \
--traffic-split 10
env:
K8S_CLUSTER: ${{ secrets.K8S_CLUSTER }}
- name: Wait and monitor canary
run: python pipelines/monitor_canary.py --duration 30m --model churn-predictor
- name: Promote to production
run: |
python pipelines/deploy.py \
--model-name churn-predictor \
--strategy promoteModel Testing Framework
Multi-level Test Suite
"""
ML Model Test Suite — runs in CI/CD after every training.
"""
import pytest
import numpy as np
import joblib
from pathlib import Path
MODEL_PATH = Path("artifacts/model/model.joblib")
TEST_DATA_PATH = Path("data/test/test.parquet")
@pytest.fixture(scope="session")
def model():
return joblib.load(MODEL_PATH)
@pytest.fixture(scope="session")
def test_data():
import pandas as pd
return pd.read_parquet(TEST_DATA_PATH)
class TestModelAccuracy:
"""Quality gates: model must meet minimum performance thresholds."""
def test_accuracy_above_threshold(self, model, test_data):
from sklearn.metrics import accuracy_score
X = test_data.drop("target", axis=1)
y = test_data["target"]
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
assert accuracy >= 0.85, f"Accuracy {accuracy:.3f} below 0.85 threshold"
def test_no_regression_vs_baseline(self, model, test_data):
"""New model must not be worse than documented baseline."""
from sklearn.metrics import f1_score
X = test_data.drop("target", axis=1)
y = test_data["target"]
y_pred = model.predict(X)
f1 = f1_score(y, y_pred, average="weighted")
BASELINE_F1 = 0.82 # Documented baseline from last stable release
assert f1 >= BASELINE_F1 - 0.02, f"F1 {f1:.3f} regressed vs baseline {BASELINE_F1}"
class TestModelRobustness:
"""Ensure model handles edge cases gracefully."""
def test_handles_missing_values(self, model):
"""Model should not crash on NaN inputs."""
sample = np.full((1, model.n_features_in_), np.nan)
try:
model.predict(sample)
except ValueError:
pass # Expected for models that don't handle NaN
# Should not raise unexpected exceptions
def test_prediction_determinism(self, model, test_data):
"""Same input should produce same output."""
X = test_data.drop("target", axis=1).head(10)
pred1 = model.predict(X)
pred2 = model.predict(X)
np.testing.assert_array_equal(pred1, pred2)
def test_prediction_latency(self, model, test_data):
"""Single prediction must be fast enough for serving SLA."""
import time
X_single = test_data.drop("target", axis=1).head(1)
times = []
for _ in range(100):
start = time.perf_counter()
model.predict(X_single)
times.append((time.perf_counter() - start) * 1000)
p99 = np.percentile(times, 99)
assert p99 < 50, f"P99 latency {p99:.1f}ms exceeds 50ms SLA"
class TestModelFairness:
"""Check for bias across protected groups."""
def test_equal_opportunity(self, model, test_data):
"""True positive rate should be similar across groups."""
from sklearn.metrics import recall_score
X = test_data.drop("target", axis=1)
y = test_data["target"]
y_pred = model.predict(X)
if "demographic_group" not in test_data.columns:
pytest.skip("No demographic column available")
groups = test_data["demographic_group"].unique()
tpr_by_group = {}
for group in groups:
mask = test_data["demographic_group"] == group
if mask.sum() < 50:
continue
tpr = recall_score(y[mask], y_pred[mask], zero_division=0)
tpr_by_group[group] = tpr
if len(tpr_by_group) < 2:
pytest.skip("Not enough groups for comparison")
max_tpr = max(tpr_by_group.values())
min_tpr = min(tpr_by_group.values())
disparity = max_tpr - min_tpr
assert disparity < 0.15, f"TPR disparity {disparity:.3f} exceeds 0.15 threshold: {tpr_by_group}"Data Version Control in CI/CD
# dvc.yaml — define reproducible pipeline stages
stages:
preprocess:
cmd: python src/preprocess.py --config configs/preprocess.yaml
deps:
- src/preprocess.py
- data/raw/
- configs/preprocess.yaml
outs:
- data/processed/
train:
cmd: python src/train.py --config configs/training/production.yaml
deps:
- src/train.py
- data/processed/
- configs/training/production.yaml
outs:
- artifacts/model/
metrics:
- metrics/training.json:
cache: false
params:
- configs/training/production.yaml:
- model.n_estimators
- model.max_depth
- model.learning_rate# In CI: reproduce the pipeline and check for changes
dvc repro
dvc metrics diff # Compare metrics vs. previous run
dvc plots diff # Generate visual comparisonDeployment Strategies for ML Models
Blue-Green Deployment
def blue_green_deploy(model_name: str, new_version: str):
"""Switch traffic atomically between model versions."""
# Deploy new version to "green" endpoint
deploy_to_endpoint(model_name, new_version, endpoint="green")
# Run smoke tests against green
if smoke_test(endpoint="green"):
# Switch traffic from blue to green
switch_traffic(model_name, from_endpoint="blue", to_endpoint="green")
# Keep blue as rollback
else:
# Tear down failed green deployment
teardown_endpoint("green")
raise DeploymentError(f"Smoke tests failed for {model_name}:{new_version}")Progressive Rollout
ROLLOUT_STAGES = [
{"traffic_pct": 5, "duration_minutes": 15, "error_threshold": 0.02},
{"traffic_pct": 25, "duration_minutes": 30, "error_threshold": 0.015},
{"traffic_pct": 50, "duration_minutes": 60, "error_threshold": 0.01},
{"traffic_pct": 100, "duration_minutes": 0, "error_threshold": 0.01},
]
async def progressive_rollout(model_name: str, new_version: str):
for stage in ROLLOUT_STAGES:
set_traffic_split(model_name, new_version, stage["traffic_pct"])
if stage["duration_minutes"] > 0:
metrics = await monitor_for(minutes=stage["duration_minutes"])
if metrics["error_rate"] > stage["error_threshold"]:
rollback(model_name)
raise RolloutError(f"Error rate {metrics['error_rate']:.3f} exceeded threshold at {stage['traffic_pct']}%")
mark_as_production(model_name, new_version)Key Takeaways
- Test data as rigorously as code — Schema validation, statistical checks, and freshness monitoring
- Quality gates before deployment — Models must beat baselines, not just pass unit tests
- Progressive deployment — Never go from 0% to 100% traffic instantly
- Version everything — Code, data, model, and config must be reproducible
- Automate retraining — Schedule or trigger-based, with human approval for production promotion
Related Resources
- MLOps best practices for the complete production playbook
- MLflow tutorial for experiment tracking in your CI/CD
- Model monitoring to close the loop after deployment
- CI/CD security with GitHub Actions for securing your ML pipelines
Building ML CI/CD? DeviDevs implements end-to-end MLOps pipelines with automated testing, progressive deployment, and monitoring. Get a free assessment →