mlops

Your ML Model Has No Rollback Plan. Fix It Before 3 AM.

Petru Constantin
7 min read
#mlops#model-deployment#canary-deployment#production-ml#kubernetes

Your ML Model Has No Rollback Plan. Fix It Before 3 AM.

You Shipped a Model. Can You Un-Ship It?

Here is a scenario that happens more often than anyone admits: your ML team trains a new recommendation model. Metrics look great in the notebook. The A/B test on staging was promising. You push it to production on a Tuesday afternoon. By Wednesday at 3 AM, conversion has dropped 12% and your on-call engineer is staring at a Grafana dashboard wondering which model version is even running.

According to VentureBeat's widely cited analysis, roughly 87% of ML projects never make it to production. But the projects that DO make it face a different problem: they deploy with no plan for what happens when things go wrong. A March 2026 MarkTechPost article found that most production ML teams still treat model deployment like software deployment. Ship it, hope for the best. The difference is that a bad software deploy usually crashes visibly. A bad model deploy degrades silently.

ML model deployment is not the same as shipping a new API endpoint. Models fail quietly. Accuracy drifts. Feature distributions shift. And if your rollback strategy is "SSH into the server and swap the model file," you are already hours behind.

Three Deployment Strategies That Actually Work

There are four common approaches to safe ML model deployment: canary, shadow, blue-green, and A/B testing. The first three are about safety. A/B testing is about learning. If you are not doing at least one of the safety strategies, you are rolling dice with production.

1. Canary Deployment: The 5% Test

Canary deployment sends a small percentage of production traffic to your new model while the old model handles the rest. If the new model performs well, you gradually increase traffic. If something breaks, you route everything back to the old model.

On Kubernetes, Argo Rollouts makes this straightforward. Here is a minimal configuration for a model serving canary:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: recommendation-model
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 30m}
        - setWeight: 20
        - pause: {duration: 1h}
        - setWeight: 50
        - pause: {duration: 2h}
      analysis:
        templates:
          - templateName: model-accuracy-check
        startingStep: 1
        args:
          - name: model-version
            value: "{{templates.recommendation-model.podTemplateHash}}"
  selector:
    matchLabels:
      app: recommendation-model
  template:
    metadata:
      labels:
        app: recommendation-model
    spec:
      containers:
        - name: model-server
          image: registry.example.com/rec-model:v2.3
          ports:
            - containerPort: 8080

The key is the analysis section. Argo Rollouts can query Prometheus, Datadog, or any metrics provider to automatically decide whether to promote or rollback. If your model's prediction latency spikes or accuracy drops below a threshold, the rollout stops and reverts. No human required at 3 AM.

For ML-specific canary analysis, define an AnalysisTemplate that checks model-level metrics:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-accuracy-check
spec:
  metrics:
    - name: prediction-latency-p99
      successCondition: result[0] < 200
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(model_predict_duration_seconds_bucket{
                model="recommendation",
                version="{{args.model-version}}"
              }[5m])) by (le)
            ) * 1000
    - name: error-rate
      successCondition: result[0] < 0.01
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(model_predict_errors_total{
              model="recommendation",
              version="{{args.model-version}}"
            }[10m]))
            /
            sum(rate(model_predict_requests_total{
              model="recommendation",
              version="{{args.model-version}}"
            }[10m]))

Start at 5% traffic. Wait 30 minutes. Check metrics. If everything is clean, move to 20%. This is how you deploy a model without betting the business on a single push.

2. Shadow Deployment: Test Without Risk

Shadow deployment runs your new model in parallel with the production model. Both receive the same requests. Only the production model's predictions reach users. The shadow model's outputs are logged for comparison.

This is the safest strategy. Zero user impact. Amazon SageMaker has shadow testing built in, but you can implement this with any service mesh. Here is a basic Istio mirror configuration:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: recommendation-service
spec:
  hosts:
    - recommendation-service
  http:
    - route:
        - destination:
            host: recommendation-service
            subset: v1
          weight: 100
      mirror:
        host: recommendation-service
        subset: v2-shadow
      mirrorPercentage:
        value: 100.0

The catch: shadow deployments double your infrastructure cost during testing. You are running two models on production traffic. For most teams, running shadows for 48-72 hours gives enough signal. You compare prediction distributions, latency profiles, and error rates between v1 and v2. If v2 looks good, promote it with a canary rollout.

3. Blue-Green: The Instant Swap

Blue-green keeps two identical production environments. One is live (blue), one is standby (green). You deploy the new model to green, validate it, then switch traffic. Rollback is just switching back to blue.

This is the simplest rollback strategy: one DNS change or load balancer update and you are back on the previous version. The downside is cost. You maintain double the infrastructure permanently.

For ML specifically, blue-green works best when combined with KServe's InferenceService, which handles model versioning natively:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: recommendation-model
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/recommendation/v2.3"
    canaryTrafficPercent: 10

KServe manages the traffic split and model version tracking. When you set canaryTrafficPercent to 0 or remove it, all traffic goes to the latest stable version. If you need to roll back, point storageUri to the previous model artifact.

The Rollback That Takes Seconds, Not Hours

The common thread across all three strategies: rollback is a configuration change, not a redeployment. If your model rollback requires rebuilding a Docker image, pulling from a registry, and redeploying the pod, you are doing it wrong. As the Duckweave analysis put it: "If rollback requires redeploying the service, you are already too slow. Model lives in a registry with pinned digests. Rollback becomes: flip policy, revert threshold, pin old model. Seconds, not hours."

Your model artifacts should be versioned in a model registry (MLflow, Vertex AI Model Registry, or even S3 with strict versioning). Your serving layer should be decoupled from your training pipeline. When something goes wrong, you change which artifact the server loads. That is it.

How DeviDevs Approaches This

We have seen teams lose days of revenue because their "rollback plan" was a Slack thread that started with "does anyone remember which model was running before?" We help companies set up progressive delivery pipelines for ML: canary analysis with automated promotion gates, shadow testing for high-risk model updates, and model registries that make rollback a one-line operation.

If your team is deploying models to production without automated rollback, you are one bad prediction away from a very long night. We have built these pipelines for recommendation systems, fraud detection models, and NLP services. The infrastructure patterns are the same. The peace of mind is worth the setup cost.

If you are building ML infrastructure and want to get deployment right, check out our MLOps resources and compliance checklists for practical templates you can use today.

What This Means For Your Team

If you take one thing from this post: separate your model artifacts from your serving infrastructure. Version everything. Automate the rollback trigger. The goal is not zero incidents. The goal is that when an incident happens, recovery takes seconds instead of hours.

Start with canary. It is the lowest-effort, highest-impact change you can make to your ML deployment pipeline. Add shadow testing for high-stakes models. Graduate to full blue-green if your budget allows.

Your models will break in production. The question is whether you find out from a monitoring alert at 10 AM or from an angry VP at 3 AM.


About DeviDevs: We build ML platforms, secure AI systems, and help companies comply with the EU AI Act. devidevs.com

Need help with EU AI Act compliance or AI security?

Book a free 30-minute consultation. No commitment.

Book a Call

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.