Your ML Model Registry Is Your EU AI Act Compliance Foundation
The Question That Breaks Compliance Audits
Here is a question that will ruin your day: which exact model version is running in production right now?
Not "the latest one." Not "the one we trained last month." The exact version, with its training data snapshot, hyperparameters, evaluation metrics, and the person who approved it for deployment.
If you cannot answer that in under 60 seconds, you have a compliance problem. And with the EU AI Act's Annex IV technical documentation requirements enforceable for high-risk AI systems, that compliance problem has a deadline attached to it.
Annex IV Section 6 requires "a description of relevant changes made by the provider to the system throughout its lifecycle." Every algorithm update, every retraining, every architecture modification, every change in training data. Documented. Versioned. Auditable.
Most ML teams I have talked to cannot even tell you how many models they have in production, let alone trace each one back to its training run.
What Annex IV Actually Demands From Your MLOps Stack
The EU AI Act does not mention "model registry" anywhere in its text. But Annex IV's nine mandatory documentation sections describe exactly what a model registry does. Let me map it out:
Section 2 (Development and Design) requires you to document "design specifications, the general logic of the system and the algorithms used," plus "training, validation and test datasets: provenance, characteristics, collection process, preparation, labelling, cleaning and governance measures" per Article 10.
Translation: you need data lineage from source to model output.
Section 4 (Performance Metrics) requires you to describe "the appropriateness of the performance metrics chosen" and provide test results including bias evaluations across population subgroups.
Translation: you need versioned evaluation results tied to specific model artifacts.
Section 6 (Lifecycle Changes) requires documentation of every significant change: algorithm updates, retraining runs, architecture modifications, training data changes.
Translation: you need a changelog tied to model versions.
Section 9 (Post-Market Monitoring) per Article 72 requires a monitoring plan including performance drift detection and incident reporting.
Translation: you need production monitoring linked back to model versions so you can correlate a drift alert with the exact artifact that is drifting.
Here is what that looks like in practice with MLflow:
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register model with full lineage
with mlflow.start_run() as run:
# Log training parameters (Annex IV Section 2)
mlflow.log_params({
"algorithm": "xgboost",
"training_data_version": "v2.3.1",
"training_data_hash": "sha256:a1b2c3d4...",
"feature_set": "credit_scoring_v4",
"preprocessing_pipeline": "v1.2.0"
})
# Log evaluation metrics (Annex IV Section 4)
mlflow.log_metrics({
"accuracy": 0.94,
"f1_score": 0.91,
"demographic_parity_ratio": 0.96,
"equalized_odds_diff": 0.03
})
# Log model artifact
mlflow.sklearn.log_model(model, "credit_model")
# Register with governance metadata (Annex IV Section 6)
model_version = mlflow.register_model(
f"runs:/{run.info.run_id}/credit_model",
"CreditScoringModel"
)
client.set_model_version_tag(
"CreditScoringModel",
model_version.version,
"approved_by", "petru.constantin"
)
client.set_model_version_tag(
"CreditScoringModel",
model_version.version,
"risk_classification", "high_risk_annex_iii"
)That is Annex IV Sections 2, 4, and 6 in about 30 lines of Python. The model, its training data hash, its evaluation metrics, its bias measurements, and the person who approved it. All versioned, all queryable, all auditable.
The Gap Between "We Use MLflow" and "We Are Compliant"
Running MLflow (or Weights & Biases, or SageMaker Model Registry, or any other tool) does not automatically make you compliant. Most teams use their model registry as a glorified file storage. They log the model artifact and maybe some metrics, then never look at it again.
Compliance requires three things most MLOps setups miss:
1. Promotion gates with approval workflows. A model cannot go from staging to production without documented human approval. Article 14 requires human oversight, and that starts with someone explicitly approving a model for deployment. Not a CI/CD pipeline auto-promoting because accuracy went up 0.2%.
# Example promotion gate in CI/CD
stages:
- name: staging
requires:
- bias_evaluation_passed: true
- performance_above_threshold: true
- data_drift_check: passed
- name: production
requires:
- staging_validation: 7_days_minimum
- human_approval: required
- risk_assessment_updated: true
- annex_iv_documentation: complete2. Training data versioning, not just model versioning. Section 2 of Annex IV explicitly requires training data provenance, collection process, and governance measures. If you retrain a model but cannot point to the exact dataset version used, your Annex IV documentation has a gap. Tools like DVC or lakeFS solve this, but most teams skip it because "we just use the latest data."
3. Continuous monitoring tied to model versions. Section 9 requires post-market monitoring with performance drift detection. When your monitoring dashboard shows accuracy dropping, can you immediately identify which model version is affected and what changed since its last evaluation? If your monitoring and your model registry are separate systems with no link between them, you are flying blind.
The 10-Year Retention Problem
Here is one most people miss: Article 18 requires you to retain technical documentation for 10 years after a high-risk AI system is placed on the market.
Ten years. Your model registry needs to survive team turnover, infrastructure migrations, and probably at least two "let's rewrite everything" cycles. That MLflow instance running on a single EC2 box? Not going to cut it.
This is where MLOps maturity actually matters for compliance:
- Model artifacts need durable storage (S3 with versioning, not a local filesystem)
- Metadata and lineage need a database that gets backed up and migrated, not an SQLite file
- Access logs showing who approved what and when need to be immutable
If you are a startup deploying high-risk AI, this is the moment to pick infrastructure that will outlast your current DevOps engineer's tenure.
What To Do This Week
If you are deploying AI systems that might fall under Annex III high-risk categories (credit scoring, HR screening, law enforcement, critical infrastructure), here is the minimum you need:
- Audit your model inventory. List every model in production. If you cannot list them, that is your first problem.
- Check your lineage. For each model, can you trace back to: training data version, training code commit, evaluation results, and who approved deployment? If any link is broken, fix it.
- Add promotion gates. No model goes to production without documented approval. Automate the checks (bias, performance, data drift), but keep the final approval human.
- Set up versioned training data. DVC, lakeFS, or even a disciplined S3 bucket with immutable snapshots. The tool matters less than the habit.
- Connect monitoring to your registry. When you get a drift alert, you should be one click away from the model version, its training run, and its evaluation results.
None of this requires buying a $50K governance platform. MLflow 3.0 handles model registry, experiment tracking, and now GenAI agent tracing out of the box. DVC handles data versioning. Prometheus plus Grafana handles monitoring. The tools are free. The discipline is what costs you.
The Compliance Advantage Is an Engineering Advantage
Here is the thing nobody in the compliance world will tell you: everything Annex IV demands is just good MLOps practice. Model versioning, data lineage, evaluation tracking, promotion gates, monitoring. Teams that do this well ship better models faster, catch regressions earlier, and debug production issues in minutes instead of days.
The EU AI Act did not invent these requirements. It just made them mandatory for high-risk systems. If your MLOps stack already does this, your compliance cost is mostly paperwork. If it does not, the regulation is giving you a reason to build the infrastructure you should have built anyway.
At DeviDevs, we have helped teams go from "we have models somewhere in S3" to fully traceable, audit-ready ML pipelines. The engineering work is the same whether you are doing it for compliance or for sanity. The regulation just adds a deadline.
About DeviDevs: We build ML platforms, secure AI systems, and help companies comply with the EU AI Act. devidevs.com