ML Experiment Tracking: Best Practices for Reproducible Machine Learning
Experiment tracking is the single most impactful MLOps practice you can adopt. It costs almost nothing to implement, immediately improves reproducibility, and forms the foundation for every other MLOps capability.
What to Track
The Minimum Viable Tracking Set
Every training run should capture:
import mlflow
from datetime import datetime
import platform
import sys
def track_experiment(model, params, metrics, data_info, tags=None):
"""Comprehensive experiment tracking template."""
with mlflow.start_run():
# 1. Parameters (what you configured)
mlflow.log_params(params)
# 2. Metrics (what you measured)
mlflow.log_metrics(metrics)
# 3. Environment (where it ran)
mlflow.set_tag("python_version", sys.version)
mlflow.set_tag("platform", platform.platform())
mlflow.set_tag("gpu_available", str(torch.cuda.is_available()) if "torch" in sys.modules else "N/A")
# 4. Data (what it trained on)
mlflow.set_tag("data_version", data_info["version"])
mlflow.set_tag("data_rows", str(data_info["rows"]))
mlflow.set_tag("data_features", str(data_info["features"]))
# 5. Custom tags
if tags:
for key, value in tags.items():
mlflow.set_tag(key, value)
# 6. Model artifact
mlflow.sklearn.log_model(model, "model")
# Usage
track_experiment(
model=trained_model,
params={"n_estimators": 200, "max_depth": 12, "learning_rate": 0.05},
metrics={"accuracy": 0.931, "f1": 0.905, "auc_roc": 0.972, "train_time_seconds": 45.2},
data_info={"version": "v2.4", "rows": 50000, "features": 24},
tags={"author": "petru", "experiment_type": "hyperparameter_sweep"},
)What People Forget to Log
| Often Missed | Why It Matters | |-------------|---------------| | Random seed | Reproducibility | | Data preprocessing steps | Feature consistency | | Train/test split method | Fair comparison | | Evaluation dataset version | Metric comparability | | Training duration | Cost estimation | | GPU utilization | Infrastructure sizing | | Failed runs | Learning from mistakes | | Null handling strategy | Data quality understanding |
Organizing Experiments
Naming Conventions
# Bad: Unnamed or generic experiments
mlflow.set_experiment("test")
mlflow.set_experiment("experiment_1")
# Good: Structured naming
mlflow.set_experiment("churn-prediction/feature-expansion-v2")
mlflow.set_experiment("pricing-model/gpu-optimization")
# Pattern: {model-name}/{experiment-purpose}Tagging Strategy
# Standard tags for every run
STANDARD_TAGS = {
"team": "ml-platform",
"use_case": "customer-churn",
"stage": "development", # development | staging | production
"data_source": "data-warehouse",
"trigger": "manual", # manual | scheduled | drift-triggered
}
# Run-specific tags
mlflow.set_tags(STANDARD_TAGS)
mlflow.set_tag("hypothesis", "Adding recency features improves churn prediction")
mlflow.set_tag("outcome", "confirmed: +2.1% F1 improvement")Nested Runs for Hyperparameter Sweeps
with mlflow.start_run(run_name="hp-sweep-2026-02-18"):
mlflow.set_tag("sweep_method", "bayesian")
mlflow.log_param("total_trials", 50)
for trial in optimizer.get_trials(50):
with mlflow.start_run(run_name=f"trial-{trial.number}", nested=True):
mlflow.log_params(trial.params)
model = train_with_params(trial.params)
metrics = evaluate(model)
mlflow.log_metrics(metrics)
# Log best result at parent level
best = optimizer.best_trial
mlflow.log_params({f"best_{k}": v for k, v in best.params.items()})
mlflow.log_metrics({f"best_{k}": v for k, v in best.metrics.items()})Comparison Workflows
A/B Model Comparison
def compare_models(run_ids: list[str], metric: str = "f1_weighted") -> dict:
"""Compare multiple experiment runs on a specific metric."""
client = mlflow.tracking.MlflowClient()
results = []
for run_id in run_ids:
run = client.get_run(run_id)
results.append({
"run_id": run_id,
"run_name": run.info.run_name,
"params": run.data.params,
"metrics": run.data.metrics,
"primary_metric": run.data.metrics.get(metric, 0),
})
results.sort(key=lambda x: x["primary_metric"], reverse=True)
return {
"best_run": results[0],
"all_runs": results,
"metric_used": metric,
"improvement": results[0]["primary_metric"] - results[-1]["primary_metric"],
}Team Collaboration Patterns
Experiment Review Checklist
Before promoting a model from experiment to staging, verify:
## Experiment Review Checklist
### Data
- [ ] Training data version documented
- [ ] Data validation passed
- [ ] No data leakage between train/test
- [ ] Class distribution is representative
### Model
- [ ] Hyperparameters documented
- [ ] Performance meets minimum thresholds
- [ ] No regression vs. current production
- [ ] Bias/fairness checks passed
- [ ] Model size within limits
### Reproducibility
- [ ] Random seed set and logged
- [ ] Environment captured (Python, packages)
- [ ] Pipeline can reproduce results from scratch
- [ ] Code committed and taggedAdvanced: Custom Metrics Logging
Logging Curves and Plots
import mlflow
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, precision_recall_curve, confusion_matrix
import numpy as np
def log_evaluation_plots(y_true, y_pred, y_proba):
"""Log evaluation plots as MLflow artifacts."""
# ROC Curve
fpr, tpr, _ = roc_curve(y_true, y_proba)
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve")
mlflow.log_figure(fig, "plots/roc_curve.png")
plt.close()
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
fig, ax = plt.subplots()
ax.imshow(cm, cmap="Blues")
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, str(cm[i, j]), ha="center", va="center")
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
ax.set_title("Confusion Matrix")
mlflow.log_figure(fig, "plots/confusion_matrix.png")
plt.close()
# Feature Importance (if available)
# mlflow.log_artifact("feature_importance.json")Step-Level Metrics for Training Curves
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
# Log at each step, creates training curves in MLflow UI
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
mlflow.log_metric("val_accuracy", val_acc, step=epoch)
mlflow.log_metric("learning_rate", optimizer.param_groups[0]["lr"], step=epoch)Experiment Tracking Anti-Patterns
| Anti-Pattern | Problem | Better Approach | |-------------|---------|----------------| | Tracking only successful runs | Can't learn from failures | Track everything, tag failures | | Logging 100+ metrics per run | Noise drowns signal | Focus on 5-10 key metrics | | No naming convention | Can't find experiments | Use structured / names | | Logging to local files | Not shared, easily lost | Use a tracking server (MLflow/W&B) | | Post-hoc logging | Metadata is wrong/incomplete | Log during training, not after |
Related Resources
- MLflow tutorial: Hands-on setup and usage
- MLOps best practices: Where tracking fits in the workflow
- What is MLOps?: Foundational concepts
- MLOps tools comparison: Choosing the right tracking tool
Need to set up experiment tracking for your team? DeviDevs implements production MLOps platforms with comprehensive tracking and collaboration. Get a free assessment →