MLOps pentru Echipe Mici: Infrastructura ML Fara Complexitate
Nu ai nevoie de Kubeflow, un feature store si o echipa dedicata de platforma ML ca sa faci MLOps bine. Echipele mici cu 1-5 ingineri ML pot construi sisteme ML de nivel production cu o fractiune din infrastructura, daca aleg uneltele si prioritatile corecte.
Stack-ul MLOps Minim Viabil
Cost: $0-200/month (infrastructure only, all open source)
┌─────────────────────────────────────────────────────┐
│ Minimal Viable MLOps Stack │
├─────────────────────────────────────────────────────┤
│ │
│ Git (code) + DVC (data) → MLflow (tracking) │
│ │ │ │
│ ▼ ▼ │
│ GitHub Actions (CI/CD) Model Registry (MLflow) │
│ │ │ │
│ ▼ ▼ │
│ pytest (testing) FastAPI (serving) │
│ │ │
│ ▼ │
│ Evidently (monitoring) │
│ │
│ Storage: S3 or GCS ($5-50/month) │
│ Compute: GitHub Actions (free tier) + cloud GPU │
└─────────────────────────────────────────────────────┘
Saptamana 1: Fundatia (Ziua 1-5)
Ziua 1: Experiment Tracking
Practica cu cel mai mare impact. Configureaza MLflow in 10 minute:
pip install mlflow
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts --port 5000Adauga autologging la codul tau de antrenare:
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("my-first-experiment")
mlflow.autolog()
# Your existing training code works unchanged, autolog captures everything
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)Ziua 2-3: Versionarea Datelor
pip install dvc[s3]
dvc init
dvc remote add -d storage s3://my-ml-data/dvc
# Track your training data
dvc add data/training.parquet
git add data/training.parquet.dvc .gitignore
git commit -m "Track training data v1"
dvc pushZiua 4-5: CI/CD de Baza
# .github/workflows/ml.yml
name: ML Pipeline
on:
push:
branches: [main]
paths: ['src/**', 'configs/**']
jobs:
test-and-train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11', cache: 'pip' }
- run: pip install -r requirements.txt
- run: pytest tests/ -v
- run: python src/train.py --config configs/production.yaml
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}Rezultat dupa Saptamana 1: Fiecare experiment este urmarit, datele sunt versionate si antrenarile ruleaza automat la push. Cost total: $0.
Luna 1: Drumul Spre Productie (Saptamana 2-4)
Adauga Model Serving
# serve.py, Simple FastAPI server
from fastapi import FastAPI
import mlflow.pyfunc
import pandas as pd
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/my-model/Production")
@app.post("/predict")
async def predict(features: dict):
df = pd.DataFrame([features])
return {"prediction": float(model.predict(df)[0])}Fa deployment cu o singura comanda:
# Option A: Railway/Render (simplest)
# Push code → auto-deploys, free tier available
# Option B: Docker
docker build -t model-api .
docker run -p 8080:8080 model-apiAdauga Monitorizare de Baza
# Weekly monitoring script (run via cron/GitHub Actions)
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
reference = pd.read_parquet("data/training.parquet")
current = pd.read_parquet("data/production_last_week.parquet")
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html("reports/drift_report.html")
# Check if retraining needed
results = report.as_dict()
if results["metrics"][0]["result"]["dataset_drift"]:
print("DRIFT DETECTED, consider retraining")
# Send Slack notificationAdauga Promovarea in Model Registry
# promote.py, Simple model promotion script
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
def promote_best_model(experiment_name: str, metric: str = "f1", min_threshold: float = 0.85):
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(experiment.experiment_id, order_by=[f"metrics.{metric} DESC"], max_results=1)
if not runs:
print("No runs found")
return
best_run = runs[0]
best_metric = best_run.data.metrics.get(metric, 0)
if best_metric < min_threshold:
print(f"Best {metric}={best_metric:.3f} below threshold {min_threshold}")
return
model_uri = f"runs:/{best_run.info.run_id}/model"
mv = mlflow.register_model(model_uri, "my-model")
client.transition_model_version_stage("my-model", mv.version, "Production", archive_existing_versions=True)
print(f"Promoted model v{mv.version} ({metric}={best_metric:.3f}) to Production")Rezultat dupa Luna 1: Modelul servit prin API, monitorizare de baza, promovare automatizata. Cost total: ~$50/luna (hosting).
Luna 3: Scalarea
Cand depasesti elementele de baza, adauga incremental:
Cand Sa Adaugi un Feature Store
Ai nevoie de unul cand: Mai multe modele impart features sau ai training-serving skew. Nu ai nevoie de unul cand: Un singur model, features calculate din input-ul brut.
# Simple alternative: Feature computation module (shared code)
# src/features.py, used by both training AND serving
def compute_features(raw_data: dict) -> dict:
return {
"recency_score": 1.0 / (1.0 + raw_data["days_since_last_purchase"] / 30),
"frequency_score": min(raw_data["purchases_last_30d"] / 10, 1.0),
"monetary_score": min(raw_data["avg_order_value"] / 100, 1.0),
}Cand Sa Adaugi Kubeflow/Airflow
Ai nevoie cand: Etape multiple de pipeline cu dependinte diferite, programare GPU sau DAG-uri complexe. Nu ai nevoie cand: Un singur script de antrenare gestioneaza totul de la un capat la altul.
# Simple alternative: Makefile pipeline
# Makefile
.PHONY: pipeline
pipeline: validate features train evaluate promote
validate:
python src/validate_data.py
features:
python src/compute_features.py
train:
python src/train.py --config configs/production.yaml
evaluate:
python src/evaluate.py
promote:
python src/promote.pyCand Sa Adaugi Auto-Retraining
Ai nevoie cand: Performanta modelului se degradeaza lunar sau datele se schimba frecvent. Nu ai nevoie cand: Modelul este re-antrenat trimestrial si performanta este stabila.
# GitHub Actions scheduled retraining
name: Weekly Retrain
on:
schedule:
- cron: '0 3 * * 0' # Sunday 3 AM
jobs:
retrain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: make pipeline
- name: Notify
if: failure()
run: curl -X POST $SLACK_WEBHOOK -d '{"text":"Retraining failed!"}'Comparatie Costuri: Echipa Mica vs. Enterprise
| Componenta | Echipa Mica | Enterprise | |-----------|-----------|-----------| | Experiment tracking | MLflow (self-hosted): $0 | W&B, $50/user/luna | | Orchestrare | GitHub Actions, $0 | Kubeflow + K8s, $500+/luna | | Feature store | Modul Python partajat, $0 | Tecton, $1000+/luna | | Serving | Railway/Render, $7-25/luna | KServe + K8s, $200+/luna | | Monitorizare | Evidently + cron, $0 | Arize, $500+/luna | | Stocare | S3, $5-50/luna | S3 + data lake, $100+/luna | | Total | $12-75/luna | $2,350+/luna |
Greseli Comune pentru Echipe Mici
-
Sa incepi cu Kubernetes: Probabil nu ai nevoie de el. Railway, Render sau o singura masina virtuala functioneaza bine pentru majoritatea nevoilor de serving.
-
Sa construiesti o "platforma": Nu ai nevoie de o echipa de platforma pana nu ai 5+ modele in productie.
-
Sa alegi unelte pe baza articolelor de blog: Netflix si-a construit platforma ML pentru probleme la scara Netflix. Tu ai probleme diferite. Incepe cu ce se potriveste echipei tale.
-
Sa sari peste experiment tracking: Aceasta este singura componenta pe care nu ar trebui sa o sari niciodata, chiar si ca inginer ML solo.
mlflow.autolog()necesita o singura linie. -
Feature store prematur: Un modul Python partajat pe care il importa atat antrenarea cat si serving-ul este un "feature store" perfect valid pentru o echipa mica.
Foaia de Parcurs MLOps pentru Echipe Mici
Luna 1: MLflow + DVC + GitHub Actions + FastAPI = Productie
Luna 2: Adauga monitorizare Evidently + promovare modele
Luna 3: Adauga re-antrenare programata + alertare
Luna 6: Evalueaza: Avem nevoie de Kubeflow/Feature store?
Luna 12: Scaleaza doar ce iti creeaza bottleneck
Resurse Conexe
- Ce este MLOps?: Prezentare completa
- Tutorial MLflow: Configurare de la zero
- Greseli MLOps frecvente: Ce sa eviti
- Comparatie unelte MLOps: Peisajul complet de unelte
- De la notebook la productie: Ghid de migrare
Construiesti ML cu o echipa mica? DeviDevs ajuta echipele de toate dimensiunile sa construiasca infrastructura MLOps dimensionata corect. Fara over-engineering, doar ce ai nevoie. Solicita o evaluare gratuita →
Sistemul tau AI e conform cu EU AI Act? Evaluare gratuita de risc - afla in 2 minute →