MLOps

MLOps pentru echipe mici: infrastructura practica

Petru Constantin
--6 min lectura
#MLOps#small teams#ML infrastructure#startup ML#practical MLOps#lean MLOps

MLOps pentru Echipe Mici: Infrastructura ML Fara Complexitate

Nu ai nevoie de Kubeflow, un feature store si o echipa dedicata de platforma ML ca sa faci MLOps bine. Echipele mici cu 1-5 ingineri ML pot construi sisteme ML de nivel production cu o fractiune din infrastructura, daca aleg uneltele si prioritatile corecte.

Stack-ul MLOps Minim Viabil

Cost: $0-200/month (infrastructure only, all open source)

┌─────────────────────────────────────────────────────┐
│            Minimal Viable MLOps Stack                  │
├─────────────────────────────────────────────────────┤
│                                                       │
│  Git (code) + DVC (data) → MLflow (tracking)         │
│       │                         │                     │
│       ▼                         ▼                     │
│  GitHub Actions (CI/CD)    Model Registry (MLflow)    │
│       │                         │                     │
│       ▼                         ▼                     │
│  pytest (testing)          FastAPI (serving)          │
│                                 │                     │
│                                 ▼                     │
│                          Evidently (monitoring)       │
│                                                       │
│  Storage: S3 or GCS ($5-50/month)                    │
│  Compute: GitHub Actions (free tier) + cloud GPU     │
└─────────────────────────────────────────────────────┘

Saptamana 1: Fundatia (Ziua 1-5)

Ziua 1: Experiment Tracking

Practica cu cel mai mare impact. Configureaza MLflow in 10 minute:

pip install mlflow
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts --port 5000

Adauga autologging la codul tau de antrenare:

import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("my-first-experiment")
mlflow.autolog()
 
# Your existing training code works unchanged, autolog captures everything
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

Ziua 2-3: Versionarea Datelor

pip install dvc[s3]
dvc init
dvc remote add -d storage s3://my-ml-data/dvc
 
# Track your training data
dvc add data/training.parquet
git add data/training.parquet.dvc .gitignore
git commit -m "Track training data v1"
dvc push

Ziua 4-5: CI/CD de Baza

# .github/workflows/ml.yml
name: ML Pipeline
on:
  push:
    branches: [main]
    paths: ['src/**', 'configs/**']
 
jobs:
  test-and-train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11', cache: 'pip' }
      - run: pip install -r requirements.txt
      - run: pytest tests/ -v
      - run: python src/train.py --config configs/production.yaml
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

Rezultat dupa Saptamana 1: Fiecare experiment este urmarit, datele sunt versionate si antrenarile ruleaza automat la push. Cost total: $0.

Luna 1: Drumul Spre Productie (Saptamana 2-4)

Adauga Model Serving

# serve.py, Simple FastAPI server
from fastapi import FastAPI
import mlflow.pyfunc
import pandas as pd
 
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/my-model/Production")
 
@app.post("/predict")
async def predict(features: dict):
    df = pd.DataFrame([features])
    return {"prediction": float(model.predict(df)[0])}

Fa deployment cu o singura comanda:

# Option A: Railway/Render (simplest)
# Push code → auto-deploys, free tier available
 
# Option B: Docker
docker build -t model-api .
docker run -p 8080:8080 model-api

Adauga Monitorizare de Baza

# Weekly monitoring script (run via cron/GitHub Actions)
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
 
reference = pd.read_parquet("data/training.parquet")
current = pd.read_parquet("data/production_last_week.parquet")
 
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html("reports/drift_report.html")
 
# Check if retraining needed
results = report.as_dict()
if results["metrics"][0]["result"]["dataset_drift"]:
    print("DRIFT DETECTED, consider retraining")
    # Send Slack notification

Adauga Promovarea in Model Registry

# promote.py, Simple model promotion script
import mlflow
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
 
def promote_best_model(experiment_name: str, metric: str = "f1", min_threshold: float = 0.85):
    experiment = client.get_experiment_by_name(experiment_name)
    runs = client.search_runs(experiment.experiment_id, order_by=[f"metrics.{metric} DESC"], max_results=1)
 
    if not runs:
        print("No runs found")
        return
 
    best_run = runs[0]
    best_metric = best_run.data.metrics.get(metric, 0)
 
    if best_metric < min_threshold:
        print(f"Best {metric}={best_metric:.3f} below threshold {min_threshold}")
        return
 
    model_uri = f"runs:/{best_run.info.run_id}/model"
    mv = mlflow.register_model(model_uri, "my-model")
    client.transition_model_version_stage("my-model", mv.version, "Production", archive_existing_versions=True)
    print(f"Promoted model v{mv.version} ({metric}={best_metric:.3f}) to Production")

Rezultat dupa Luna 1: Modelul servit prin API, monitorizare de baza, promovare automatizata. Cost total: ~$50/luna (hosting).

Luna 3: Scalarea

Cand depasesti elementele de baza, adauga incremental:

Cand Sa Adaugi un Feature Store

Ai nevoie de unul cand: Mai multe modele impart features sau ai training-serving skew. Nu ai nevoie de unul cand: Un singur model, features calculate din input-ul brut.

# Simple alternative: Feature computation module (shared code)
# src/features.py, used by both training AND serving
def compute_features(raw_data: dict) -> dict:
    return {
        "recency_score": 1.0 / (1.0 + raw_data["days_since_last_purchase"] / 30),
        "frequency_score": min(raw_data["purchases_last_30d"] / 10, 1.0),
        "monetary_score": min(raw_data["avg_order_value"] / 100, 1.0),
    }

Cand Sa Adaugi Kubeflow/Airflow

Ai nevoie cand: Etape multiple de pipeline cu dependinte diferite, programare GPU sau DAG-uri complexe. Nu ai nevoie cand: Un singur script de antrenare gestioneaza totul de la un capat la altul.

# Simple alternative: Makefile pipeline
# Makefile
.PHONY: pipeline
 
pipeline: validate features train evaluate promote
 
validate:
	python src/validate_data.py
 
features:
	python src/compute_features.py
 
train:
	python src/train.py --config configs/production.yaml
 
evaluate:
	python src/evaluate.py
 
promote:
	python src/promote.py

Cand Sa Adaugi Auto-Retraining

Ai nevoie cand: Performanta modelului se degradeaza lunar sau datele se schimba frecvent. Nu ai nevoie cand: Modelul este re-antrenat trimestrial si performanta este stabila.

# GitHub Actions scheduled retraining
name: Weekly Retrain
on:
  schedule:
    - cron: '0 3 * * 0'  # Sunday 3 AM
jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: make pipeline
      - name: Notify
        if: failure()
        run: curl -X POST $SLACK_WEBHOOK -d '{"text":"Retraining failed!"}'

Comparatie Costuri: Echipa Mica vs. Enterprise

| Componenta | Echipa Mica | Enterprise | |-----------|-----------|-----------| | Experiment tracking | MLflow (self-hosted): $0 | W&B, $50/user/luna | | Orchestrare | GitHub Actions, $0 | Kubeflow + K8s, $500+/luna | | Feature store | Modul Python partajat, $0 | Tecton, $1000+/luna | | Serving | Railway/Render, $7-25/luna | KServe + K8s, $200+/luna | | Monitorizare | Evidently + cron, $0 | Arize, $500+/luna | | Stocare | S3, $5-50/luna | S3 + data lake, $100+/luna | | Total | $12-75/luna | $2,350+/luna |

Greseli Comune pentru Echipe Mici

  1. Sa incepi cu Kubernetes: Probabil nu ai nevoie de el. Railway, Render sau o singura masina virtuala functioneaza bine pentru majoritatea nevoilor de serving.

  2. Sa construiesti o "platforma": Nu ai nevoie de o echipa de platforma pana nu ai 5+ modele in productie.

  3. Sa alegi unelte pe baza articolelor de blog: Netflix si-a construit platforma ML pentru probleme la scara Netflix. Tu ai probleme diferite. Incepe cu ce se potriveste echipei tale.

  4. Sa sari peste experiment tracking: Aceasta este singura componenta pe care nu ar trebui sa o sari niciodata, chiar si ca inginer ML solo. mlflow.autolog() necesita o singura linie.

  5. Feature store prematur: Un modul Python partajat pe care il importa atat antrenarea cat si serving-ul este un "feature store" perfect valid pentru o echipa mica.

Foaia de Parcurs MLOps pentru Echipe Mici

Luna 1:   MLflow + DVC + GitHub Actions + FastAPI = Productie
Luna 2:   Adauga monitorizare Evidently + promovare modele
Luna 3:   Adauga re-antrenare programata + alertare
Luna 6:   Evalueaza: Avem nevoie de Kubeflow/Feature store?
Luna 12:  Scaleaza doar ce iti creeaza bottleneck

Resurse Conexe


Construiesti ML cu o echipa mica? DeviDevs ajuta echipele de toate dimensiunile sa construiasca infrastructura MLOps dimensionata corect. Fara over-engineering, doar ce ai nevoie. Solicita o evaluare gratuita →


Sistemul tau AI e conform cu EU AI Act? Evaluare gratuita de risc - afla in 2 minute →

Ai nevoie de ajutor cu conformitatea EU AI Act sau securitatea AI?

Programeaza o consultatie gratuita de 30 de minute. Fara obligatii.

Programeaza un Apel

Weekly AI Security & Automation Digest

Get the latest on AI Security, workflow automation, secure integrations, and custom platform development delivered weekly.

No spam. Unsubscribe anytime.