Your ML Pipeline Has No Tests. Here's How to Fix That.
Your application has 90% test coverage. Your ML pipeline has 0%.
You wouldn't deploy a REST API without unit tests. You'd get laughed out of the PR review. But somehow, the ML pipeline that feeds predictions to that same API ships with zero validation, zero gates, and a prayer that the training data didn't change since last Tuesday.
According to Gartner, 85% of ML projects fail. Not because the models are bad. Because everything around the model, the pipeline infrastructure, the data validation, the deployment gates, is held together with duct tape and Jupyter notebooks.
If you've got models in production, this is the part where you either fix it or join the 85%.
The Three Failures Nobody Tests For
Traditional software fails loud. A null pointer throws an exception. A broken API returns a 500. You get paged, you fix it, you move on.
ML pipelines fail quiet. The model keeps serving predictions. The HTTP status is 200. But the predictions are garbage because something upstream changed and nobody noticed.
Here are the three silent killers:
1. Data drift. Your model trained on Q4 2025 data. It's March 2026. Customer behavior shifted, a feature distribution changed, and your model's accuracy dropped 12% over two months. No alert fired because you never set one up. Standard application monitoring (CPU, memory, HTTP errors) shows green across the board. The model is perfectly healthy from an infrastructure perspective. It's just wrong.
2. Schema breaks. An upstream team renamed a column from user_age to customer_age. Your feature pipeline silently fills it with nulls. The model happily consumes nulls and outputs nonsense. This exact scenario broke a production ML pipeline when a simple API change cascaded into a full outage.
3. Training-serving skew. Your training pipeline computes features one way. Your serving pipeline computes them differently. The model works great in offline evaluation and falls apart in production. Nobody catches it because nobody tests for it. This one is especially common when the data science team uses pandas for training and engineering uses Spark or a feature store for serving. Same logic, two implementations, subtle numerical differences that compound into wrong predictions.
Layer 1: Data Validation with Great Expectations
Before your model sees a single row, validate the data. Great Expectations is the standard tool here, and it plugs straight into pytest.
import great_expectations as gx
context = gx.get_context()
# Define what "healthy data" looks like
suite = context.add_expectation_suite("training_data_quality")
# These are your data unit tests
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="training_data_quality"
)
# Column exists and has no nulls
validator.expect_column_to_exist("user_age")
validator.expect_column_values_to_not_be_null("user_age")
# Distribution hasn't shifted wildly
validator.expect_column_mean_to_be_between("user_age", min_value=25, max_value=45)
validator.expect_column_values_to_be_between("user_age", min_value=13, max_value=120)
results = validator.validate()
assert results.success, f"Data validation failed: {results.statistics}"Run this in CI before training starts. If the data is wrong, the pipeline stops. No bad model gets created in the first place.
The key insight: treat data expectations like application assertions. You wouldn't skip input validation on an API endpoint. Don't skip it on your training data.
Layer 2: Model Validation Gates
A trained model should pass a test suite before it touches production. Not "looks good in a notebook." A real, automated gate.
# tests/test_model_quality.py
import mlflow
import pytest
@pytest.fixture
def candidate_model():
"""Load the model that just finished training."""
return mlflow.pyfunc.load_model("runs:/latest/model")
@pytest.fixture
def baseline_metrics():
"""Production model's metrics, our minimum bar."""
prod_model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
return evaluate_model(prod_model, test_dataset)
def test_accuracy_above_threshold(candidate_model):
metrics = evaluate_model(candidate_model, test_dataset)
assert metrics["auc"] >= 0.90, (
f"AUC {metrics['auc']:.3f} below threshold 0.90"
)
def test_no_regression_vs_production(candidate_model, baseline_metrics):
candidate_metrics = evaluate_model(candidate_model, test_dataset)
assert candidate_metrics["auc"] >= baseline_metrics["auc"] - 0.02, (
f"Candidate AUC {candidate_metrics['auc']:.3f} regressed vs "
f"production {baseline_metrics['auc']:.3f}"
)
def test_inference_latency(candidate_model):
import time
start = time.perf_counter()
candidate_model.predict(sample_batch)
elapsed = time.perf_counter() - start
assert elapsed < 0.1, f"Inference took {elapsed:.3f}s, limit is 0.1s"Wire this into your CI pipeline. GitHub Actions, GitLab CI, whatever you use. The model registry promotion from "Staging" to "Production" should be gated by this test suite, not by someone clicking a button in a UI.
Layer 3: Training-Serving Parity Tests
This is the one nobody does, and it causes the nastiest bugs. Your training pipeline uses pandas. Your serving pipeline uses a Spark feature store. Same logic, different implementations, different results.
# tests/test_feature_parity.py
def test_feature_computation_matches():
"""Same input should produce same features,
regardless of which pipeline computes them."""
raw_input = load_test_fixture("sample_user_event.json")
# Compute features both ways
training_features = training_pipeline.compute_features(raw_input)
serving_features = serving_pipeline.compute_features(raw_input)
for feature_name in training_features:
assert abs(
training_features[feature_name] - serving_features[feature_name]
) < 1e-6, (
f"Feature {feature_name} differs: "
f"training={training_features[feature_name]}, "
f"serving={serving_features[feature_name]}"
)Run this nightly. When someone updates the training pipeline's feature logic and forgets to update the serving side, this test catches it before users do.
Putting It Together: The ML CI/CD Pipeline
Here's what a real ML CI/CD pipeline looks like:
# .github/workflows/ml-pipeline.yml
name: ML Pipeline CI/CD
on:
push:
paths: ["ml/**", "features/**"]
jobs:
data-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install great-expectations pytest
- run: pytest tests/test_data_quality.py -v
model-training:
needs: data-validation
runs-on: [self-hosted, gpu]
steps:
- run: python ml/train.py --experiment ci-run-${{ github.sha }}
model-validation:
needs: model-training
steps:
- run: pytest tests/test_model_quality.py -v
- run: pytest tests/test_feature_parity.py -v
promote-to-staging:
needs: model-validation
steps:
- run: |
python -c "
import mlflow
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name='fraud-detector',
version='${{ env.MODEL_VERSION }}',
stage='Staging'
)"Data validation runs first. If it fails, training never starts. Model validation runs after training. If the model regresses or is too slow, it never gets promoted. Each stage gates the next.
How DeviDevs Approaches This
We build ML platforms for companies that have models in production but no testing infrastructure around them. The pattern is always the same: a data science team built something impressive, threw it over the wall to engineering, and now nobody knows why predictions went sideways last month.
The fix isn't a new framework or a fancier model. It's treating your ML pipeline like the production system it is, with the same testing discipline you'd apply to any other critical path.
If you're running models in production without validation gates, you're not doing MLOps. You're doing hope-driven development.
What This Means For Your Pipeline
Three layers. Data validation before training. Model quality gates before promotion. Feature parity tests between training and serving. None of this is exotic. It's the same CI/CD discipline you already apply to application code, extended to the ML pipeline.
The 85% failure rate isn't about bad models. It's about bad engineering practices around good models. The fix is boring, unglamorous, and entirely within your control.
You already know how to write tests. You already know how to set up CI/CD gates. You already know that untested code is broken code. The only thing missing is applying that same discipline to the ML pipeline sitting next to your application code.
Start with data validation. Add model gates. Test feature parity. Your models deserve the same engineering rigor as your APIs.
About DeviDevs: We build ML platforms, secure AI systems, and help companies comply with the EU AI Act. devidevs.com