MLOps for Small Teams: Building ML Infrastructure Without the Complexity
You don't need Kubeflow, a feature store, and a dedicated ML platform team to do MLOps well. Small teams with 1-5 ML engineers can build production-grade ML systems with a fraction of the infrastructure — if they choose the right tools and priorities.
The Minimal Viable MLOps Stack
Cost: $0-200/month (infrastructure only, all open source)
┌─────────────────────────────────────────────────────┐
│ Minimal Viable MLOps Stack │
├─────────────────────────────────────────────────────┤
│ │
│ Git (code) + DVC (data) → MLflow (tracking) │
│ │ │ │
│ ▼ ▼ │
│ GitHub Actions (CI/CD) Model Registry (MLflow) │
│ │ │ │
│ ▼ ▼ │
│ pytest (testing) FastAPI (serving) │
│ │ │
│ ▼ │
│ Evidently (monitoring) │
│ │
│ Storage: S3 or GCS ($5-50/month) │
│ Compute: GitHub Actions (free tier) + cloud GPU │
└─────────────────────────────────────────────────────┘
Week 1: Foundation (Day 1-5)
Day 1: Experiment Tracking
The single highest-impact practice. Set up MLflow in 10 minutes:
pip install mlflow
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts --port 5000Add autologging to your training code:
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("my-first-experiment")
mlflow.autolog()
# Your existing training code works unchanged — autolog captures everything
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)Day 2-3: Data Versioning
pip install dvc[s3]
dvc init
dvc remote add -d storage s3://my-ml-data/dvc
# Track your training data
dvc add data/training.parquet
git add data/training.parquet.dvc .gitignore
git commit -m "Track training data v1"
dvc pushDay 4-5: Basic CI/CD
# .github/workflows/ml.yml
name: ML Pipeline
on:
push:
branches: [main]
paths: ['src/**', 'configs/**']
jobs:
test-and-train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11', cache: 'pip' }
- run: pip install -r requirements.txt
- run: pytest tests/ -v
- run: python src/train.py --config configs/production.yaml
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}Result after Week 1: Every experiment is tracked, data is versioned, and training runs automatically on push. Total cost: $0.
Month 1: Production Path (Week 2-4)
Add Model Serving
# serve.py — Simple FastAPI server
from fastapi import FastAPI
import mlflow.pyfunc
import pandas as pd
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/my-model/Production")
@app.post("/predict")
async def predict(features: dict):
df = pd.DataFrame([features])
return {"prediction": float(model.predict(df)[0])}Deploy with a single command:
# Option A: Railway/Render (simplest)
# Push code → auto-deploys, free tier available
# Option B: Docker
docker build -t model-api .
docker run -p 8080:8080 model-apiAdd Basic Monitoring
# Weekly monitoring script (run via cron/GitHub Actions)
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
reference = pd.read_parquet("data/training.parquet")
current = pd.read_parquet("data/production_last_week.parquet")
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html("reports/drift_report.html")
# Check if retraining needed
results = report.as_dict()
if results["metrics"][0]["result"]["dataset_drift"]:
print("DRIFT DETECTED — consider retraining")
# Send Slack notificationAdd Model Registry Promotion
# promote.py — Simple model promotion script
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
def promote_best_model(experiment_name: str, metric: str = "f1", min_threshold: float = 0.85):
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(experiment.experiment_id, order_by=[f"metrics.{metric} DESC"], max_results=1)
if not runs:
print("No runs found")
return
best_run = runs[0]
best_metric = best_run.data.metrics.get(metric, 0)
if best_metric < min_threshold:
print(f"Best {metric}={best_metric:.3f} below threshold {min_threshold}")
return
model_uri = f"runs:/{best_run.info.run_id}/model"
mv = mlflow.register_model(model_uri, "my-model")
client.transition_model_version_stage("my-model", mv.version, "Production", archive_existing_versions=True)
print(f"Promoted model v{mv.version} ({metric}={best_metric:.3f}) to Production")Result after Month 1: Model served via API, basic monitoring, automated promotion. Total cost: ~$50/month (hosting).
Month 3: Scaling Up
When you outgrow the basics, add these incrementally:
When to Add a Feature Store
You need one when: Multiple models share features, or you have training-serving skew. You don't need one when: Single model, features computed from raw input.
# Simple alternative: Feature computation module (shared code)
# src/features.py — used by both training AND serving
def compute_features(raw_data: dict) -> dict:
return {
"recency_score": 1.0 / (1.0 + raw_data["days_since_last_purchase"] / 30),
"frequency_score": min(raw_data["purchases_last_30d"] / 10, 1.0),
"monetary_score": min(raw_data["avg_order_value"] / 100, 1.0),
}When to Add Kubeflow/Airflow
You need it when: Multiple pipeline stages with different dependencies, GPU scheduling, or complex DAGs. You don't need it when: A single training script handles everything end-to-end.
# Simple alternative: Makefile pipeline
# Makefile
.PHONY: pipeline
pipeline: validate features train evaluate promote
validate:
python src/validate_data.py
features:
python src/compute_features.py
train:
python src/train.py --config configs/production.yaml
evaluate:
python src/evaluate.py
promote:
python src/promote.pyWhen to Add Auto-Retraining
You need it when: Model performance degrades monthly, or data changes frequently. You don't need it when: Model is retrained quarterly and performance is stable.
# GitHub Actions scheduled retraining
name: Weekly Retrain
on:
schedule:
- cron: '0 3 * * 0' # Sunday 3 AM
jobs:
retrain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: make pipeline
- name: Notify
if: failure()
run: curl -X POST $SLACK_WEBHOOK -d '{"text":"Retraining failed!"}'Cost Comparison: Small Team vs. Enterprise
| Component | Small Team | Enterprise | |-----------|-----------|-----------| | Experiment tracking | MLflow (self-hosted) — $0 | W&B — $50/user/mo | | Orchestration | GitHub Actions — $0 | Kubeflow + K8s — $500+/mo | | Feature store | Shared Python module — $0 | Tecton — $1000+/mo | | Serving | Railway/Render — $7-25/mo | KServe + K8s — $200+/mo | | Monitoring | Evidently + cron — $0 | Arize — $500+/mo | | Storage | S3 — $5-50/mo | S3 + data lake — $100+/mo | | Total | $12-75/mo | $2,350+/mo |
Common Mistakes for Small Teams
-
Starting with Kubernetes — You probably don't need it. Railway, Render, or a single VM works fine for most serving needs.
-
Building a "platform" — You don't need a platform team until you have 5+ models in production.
-
Choosing tools based on blog posts — Netflix built their ML platform for Netflix-scale problems. You have different problems. Start with what fits your team.
-
Skipping experiment tracking — This is the one thing you should never skip, even as a solo ML engineer.
mlflow.autolog()takes one line. -
Premature feature store — A shared Python module that both training and serving import is a perfectly valid "feature store" for a small team.
The Small Team MLOps Roadmap
Month 1: MLflow + DVC + GitHub Actions + FastAPI = Production
Month 2: Add Evidently monitoring + model promotion
Month 3: Add scheduled retraining + alerting
Month 6: Evaluate: Do we need Kubeflow/Feature store?
Month 12: Scale only what's bottlenecking you
Related Resources
- What is MLOps? — Complete overview
- MLflow tutorial — Day 1 setup
- Common MLOps mistakes — What to avoid
- MLOps tools comparison — Full tool landscape
- Notebook to production — Migration guide
Building ML with a small team? DeviDevs helps teams of all sizes build right-sized MLOps infrastructure. No over-engineering, just what you need. Get a free assessment →