Your ML Model Will Die in Production. Here Is How to Stop It.
You shipped the model. Now what?
Your model hit 94% accuracy in the notebook. The demo went great. Stakeholders clapped. You deployed it on a Friday afternoon because you were feeling brave.
Three weeks later, conversion dropped 12% and nobody connected it to the model. The fraud team noticed a spike in false negatives a month after that. By then, the model had been silently wrong for six weeks.
This is not a hypothetical. According to a Gartner study, roughly half of AI models never make it to production. But the ones that do face a worse fate: they degrade quietly, and teams only find out when the business damage is already done. DataRobot research found that 67% of enterprises reported critical model issues going unnoticed for over a month.
Zillow learned this the hard way. Their Zestimate pricing algorithm was trained on a relatively stable housing market. When pandemic-era volatility hit, the model kept overvaluing properties. Zillow bought thousands of overpriced homes before anyone pulled the plug. The result: $421 million in losses in a single quarter, 25% of the workforce laid off, and an $8 billion drop in market cap.
The model worked. Until it did not.
Why models break: drift is guaranteed
If your product changes, your users change, your competitors change, seasonality exists, or your data pipelines evolve, drift is not a risk. It is a certainty.
There are two types that kill models:
Data drift happens when the distribution of input features changes. A loan underwriting model trained on pre-pandemic income patterns encounters completely different employment distributions in 2025. The features (income levels, job categories, credit utilization) shift statistically from what the model learned. Predictions become unreliable, but no error is thrown.
Concept drift happens when the relationship between inputs and outputs changes. A content recommendation model that learned "users who search for 'remote work' want job listings" suddenly finds that in 2026, the same query means "remote work productivity tools." The inputs look the same. The correct output changed. The model has no way to know.
Both types share one property: the model does not crash. It does not throw exceptions. It just gets worse, slowly, in ways that are invisible unless you are watching specific metrics.
What monitoring actually looks like
Most teams think "monitoring" means checking if the API returns 200. That catches infrastructure failures. It catches zero ML failures.
Real ML monitoring has three layers:
Layer 1: Data quality and distribution tracking. Before the model even runs, check that inputs match what the model expects. Track feature distributions with statistical tests (Population Stability Index, Kolmogorov-Smirnov). If your "age" feature suddenly has 40% null values because an upstream pipeline broke, you want to know before the model produces garbage predictions.
Tools that handle this: Evidently AI (open source, good for tabular and text data), whylogs (lightweight data logging), Deepchecks (validation suites). All three are open source.
Layer 2: Prediction monitoring. Track the distribution of model outputs over time. If your binary classifier suddenly predicts "positive" for 80% of inputs when the historical rate was 30%, something is wrong. You do not need ground truth labels for this. Just watch the predictions.
A simple approach:
import numpy as np
from scipy import stats
def check_prediction_drift(reference_preds, current_preds, threshold=0.05):
"""KS test on prediction distributions. Returns True if drift detected."""
statistic, p_value = stats.ks_2samp(reference_preds, current_preds)
return p_value < threshold, statistic, p_value
# Run daily against your reference window
ref = load_predictions(window="2026-02-01", days=30)
current = load_predictions(window="2026-03-15", days=7)
drifted, stat, pval = check_prediction_drift(ref, current)
if drifted:
alert(f"Prediction drift detected: KS={stat:.3f}, p={pval:.4f}")Layer 3: Business metric correlation. Connect model predictions to downstream business outcomes. If conversion drops, is it correlated with a shift in model confidence scores? This is where most teams stop because it requires cross-team coordination. It is also where the real value is.
The monitoring stack does not need to be complex. A Prometheus/Grafana setup with custom metrics for feature distributions and prediction stats gets you 80% of the way there. The missing 20% is discipline: someone has to look at the dashboards and act on alerts.
The regulatory angle most teams ignore
If your ML model operates in the EU and qualifies as high-risk under the EU AI Act, monitoring is not optional. It is law.
Article 72 requires providers of high-risk AI systems to establish a post-market monitoring system that actively collects and analyzes performance data throughout the system's lifetime. Article 15 requires that high-risk systems "achieve an appropriate level of accuracy, robustness, and cybersecurity" and "perform consistently throughout their lifecycle."
That last part, "perform consistently throughout their lifecycle," is the EU telling you to detect drift.
The regulation also specifically addresses feedback loops: systems that continue learning after deployment must be designed to "eliminate or reduce the risk of possibly biased outputs influencing input for future operations." If your recommendation engine is reinforcing its own biases because nobody is monitoring the feedback cycle, that is a compliance problem.
The practical overlap between good MLOps and EU AI Act compliance is almost total:
| MLOps best practice | EU AI Act requirement | |---|---| | Data quality monitoring | Art. 10 (data governance) | | Drift detection | Art. 15 (accuracy throughout lifecycle) | | Model versioning and lineage | Art. 11 (technical documentation) | | Performance dashboards | Art. 72 (post-market monitoring plan) | | Incident response for model failures | Art. 73 (serious incident reporting) | | Bias monitoring | Art. 15 (feedback loop prevention) |
Companies that already run proper MLOps are 80% compliant with the monitoring requirements. Companies that skip monitoring will need to build everything from scratch when compliance deadlines hit.
Start with these three things
If your models are in production with no monitoring today, do not try to build a full observability platform. Start small:
1. Log every prediction. Not just the output, but the input features, model version, timestamp, and confidence score. Store it somewhere queryable. This costs almost nothing and gives you the data for everything else.
2. Set up one drift alert. Pick your most important feature. Compute a rolling PSI or KS statistic against your training distribution. Alert when it crosses a threshold. One feature, one alert. Expand from there.
3. Track one business metric against model performance. Pick the metric your model is supposed to improve (conversion, fraud rate, churn). Plot it alongside model confidence distributions on the same dashboard. When they diverge, investigate.
These three steps take a senior ML engineer about a week to implement. Skipping them and hoping your model stays accurate forever is how you become the next Zillow.
We have been there
At DeviDevs, we build ML platforms where monitoring is part of the architecture from day one, not an afterthought bolted on after the first incident. We also help teams figure out which of their models qualify as high-risk under the EU AI Act, because the monitoring requirements for compliance are the same ones that prevent production failures.
If your models are running without monitoring, it is not a matter of if they will break. It is a matter of when you will notice.
About DeviDevs: We build ML platforms, secure AI systems, and help companies comply with the EU AI Act. devidevs.com