MLOps

ML Experiment Tracking: Best Practices for Reproducible Machine Learning

Petru Constantin
--5 min lectura
#experiment tracking#reproducible ML#MLOps#MLflow#Weights & Biases#ML experiments

ML Experiment Tracking: Best Practices for Reproducible Machine Learning

Experiment tracking is the single most impactful MLOps practice you can adopt. It costs almost nothing to implement, immediately improves reproducibility, and forms the foundation for every other MLOps capability.

What to Track

The Minimum Viable Tracking Set

Every training run should capture:

import mlflow
from datetime import datetime
import platform
import sys
 
def track_experiment(model, params, metrics, data_info, tags=None):
    """Comprehensive experiment tracking template."""
    with mlflow.start_run():
        # 1. Parameters (what you configured)
        mlflow.log_params(params)
 
        # 2. Metrics (what you measured)
        mlflow.log_metrics(metrics)
 
        # 3. Environment (where it ran)
        mlflow.set_tag("python_version", sys.version)
        mlflow.set_tag("platform", platform.platform())
        mlflow.set_tag("gpu_available", str(torch.cuda.is_available()) if "torch" in sys.modules else "N/A")
 
        # 4. Data (what it trained on)
        mlflow.set_tag("data_version", data_info["version"])
        mlflow.set_tag("data_rows", str(data_info["rows"]))
        mlflow.set_tag("data_features", str(data_info["features"]))
 
        # 5. Custom tags
        if tags:
            for key, value in tags.items():
                mlflow.set_tag(key, value)
 
        # 6. Model artifact
        mlflow.sklearn.log_model(model, "model")
 
# Usage
track_experiment(
    model=trained_model,
    params={"n_estimators": 200, "max_depth": 12, "learning_rate": 0.05},
    metrics={"accuracy": 0.931, "f1": 0.905, "auc_roc": 0.972, "train_time_seconds": 45.2},
    data_info={"version": "v2.4", "rows": 50000, "features": 24},
    tags={"author": "petru", "experiment_type": "hyperparameter_sweep"},
)

What People Forget to Log

| Often Missed | Why It Matters | |-------------|---------------| | Random seed | Reproducibility | | Data preprocessing steps | Feature consistency | | Train/test split method | Fair comparison | | Evaluation dataset version | Metric comparability | | Training duration | Cost estimation | | GPU utilization | Infrastructure sizing | | Failed runs | Learning from mistakes | | Null handling strategy | Data quality understanding |

Organizing Experiments

Naming Conventions

# Bad: Unnamed or generic experiments
mlflow.set_experiment("test")
mlflow.set_experiment("experiment_1")
 
# Good: Structured naming
mlflow.set_experiment("churn-prediction/feature-expansion-v2")
mlflow.set_experiment("pricing-model/gpu-optimization")
 
# Pattern: {model-name}/{experiment-purpose}

Tagging Strategy

# Standard tags for every run
STANDARD_TAGS = {
    "team": "ml-platform",
    "use_case": "customer-churn",
    "stage": "development",  # development | staging | production
    "data_source": "data-warehouse",
    "trigger": "manual",  # manual | scheduled | drift-triggered
}
 
# Run-specific tags
mlflow.set_tags(STANDARD_TAGS)
mlflow.set_tag("hypothesis", "Adding recency features improves churn prediction")
mlflow.set_tag("outcome", "confirmed: +2.1% F1 improvement")

Nested Runs for Hyperparameter Sweeps

with mlflow.start_run(run_name="hp-sweep-2026-02-18"):
    mlflow.set_tag("sweep_method", "bayesian")
    mlflow.log_param("total_trials", 50)
 
    for trial in optimizer.get_trials(50):
        with mlflow.start_run(run_name=f"trial-{trial.number}", nested=True):
            mlflow.log_params(trial.params)
            model = train_with_params(trial.params)
            metrics = evaluate(model)
            mlflow.log_metrics(metrics)
 
    # Log best result at parent level
    best = optimizer.best_trial
    mlflow.log_params({f"best_{k}": v for k, v in best.params.items()})
    mlflow.log_metrics({f"best_{k}": v for k, v in best.metrics.items()})

Comparison Workflows

A/B Model Comparison

def compare_models(run_ids: list[str], metric: str = "f1_weighted") -> dict:
    """Compare multiple experiment runs on a specific metric."""
    client = mlflow.tracking.MlflowClient()
    results = []
 
    for run_id in run_ids:
        run = client.get_run(run_id)
        results.append({
            "run_id": run_id,
            "run_name": run.info.run_name,
            "params": run.data.params,
            "metrics": run.data.metrics,
            "primary_metric": run.data.metrics.get(metric, 0),
        })
 
    results.sort(key=lambda x: x["primary_metric"], reverse=True)
 
    return {
        "best_run": results[0],
        "all_runs": results,
        "metric_used": metric,
        "improvement": results[0]["primary_metric"] - results[-1]["primary_metric"],
    }

Team Collaboration Patterns

Experiment Review Checklist

Before promoting a model from experiment to staging, verify:

## Experiment Review Checklist
 
### Data
- [ ] Training data version documented
- [ ] Data validation passed
- [ ] No data leakage between train/test
- [ ] Class distribution is representative
 
### Model
- [ ] Hyperparameters documented
- [ ] Performance meets minimum thresholds
- [ ] No regression vs. current production
- [ ] Bias/fairness checks passed
- [ ] Model size within limits
 
### Reproducibility
- [ ] Random seed set and logged
- [ ] Environment captured (Python, packages)
- [ ] Pipeline can reproduce results from scratch
- [ ] Code committed and tagged

Advanced: Custom Metrics Logging

Logging Curves and Plots

import mlflow
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, precision_recall_curve, confusion_matrix
import numpy as np
 
def log_evaluation_plots(y_true, y_pred, y_proba):
    """Log evaluation plots as MLflow artifacts."""
 
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("ROC Curve")
    mlflow.log_figure(fig, "plots/roc_curve.png")
    plt.close()
 
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots()
    ax.imshow(cm, cmap="Blues")
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, str(cm[i, j]), ha="center", va="center")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    ax.set_title("Confusion Matrix")
    mlflow.log_figure(fig, "plots/confusion_matrix.png")
    plt.close()
 
    # Feature Importance (if available)
    # mlflow.log_artifact("feature_importance.json")

Step-Level Metrics for Training Curves

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss, val_acc = evaluate(model, val_loader)
 
    # Log at each step, creates training curves in MLflow UI
    mlflow.log_metric("train_loss", train_loss, step=epoch)
    mlflow.log_metric("val_loss", val_loss, step=epoch)
    mlflow.log_metric("val_accuracy", val_acc, step=epoch)
    mlflow.log_metric("learning_rate", optimizer.param_groups[0]["lr"], step=epoch)

Experiment Tracking Anti-Patterns

| Anti-Pattern | Problem | Better Approach | |-------------|---------|----------------| | Tracking only successful runs | Can't learn from failures | Track everything, tag failures | | Logging 100+ metrics per run | Noise drowns signal | Focus on 5-10 key metrics | | No naming convention | Can't find experiments | Use structured / names | | Logging to local files | Not shared, easily lost | Use a tracking server (MLflow/W&B) | | Post-hoc logging | Metadata is wrong/incomplete | Log during training, not after |


Need to set up experiment tracking for your team? DeviDevs implements production MLOps platforms with comprehensive tracking and collaboration. Get a free assessment →

Ai nevoie de ajutor cu conformitatea EU AI Act sau securitatea AI?

Programeaza o consultatie gratuita de 30 de minute. Fara obligatii.

Programeaza un Apel

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.