Feature Store Architecture: Design Patterns for ML Feature Engineering
Feature stores solve one of the most common problems in production ML: the training-serving skew. When features are computed differently during training (batch, in notebooks) and serving (real-time, in production), model performance silently degrades. A feature store ensures consistency by providing a single source of truth for feature computation and storage.
The Training-Serving Skew Problem
Consider a churn prediction model that uses "average order value over last 30 days":
# During training (data scientist in notebook)
df["avg_order_value_30d"] = df.groupby("customer_id")["order_value"].transform(
lambda x: x.rolling("30D").mean()
)
# During serving (engineer in production)
# SELECT AVG(order_value) FROM orders
# WHERE customer_id = ? AND created_at > NOW() - INTERVAL '30 days'These two implementations will produce different results due to:
- Different handling of edge cases (nulls, first 30 days)
- Different timestamp semantics (end-of-day vs. point-in-time)
- Different data sources (data lake snapshot vs. live database)
A feature store eliminates this by computing features once and serving them consistently.
Feature Store Architecture
A feature store has four core components:
┌─────────────────────────────────────────────────────────────┐
│ Feature Store Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Feature Definitions (Code) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ feature_views.py — Declarative feature definitions │ │
│ │ entities.py — Entity definitions (customer, product) │ │
│ │ sources.py — Data source connections │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ ▼ ▼ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Offline Store │ │ Online Store │ │
│ │ (Training Data) │ │ (Serving Data) │ │
│ │ │ │ │ │
│ │ - BigQuery/S3 │ │ - Redis/DynamoDB │ │
│ │ - Full history │ │ - Latest values │ │
│ │ - Batch reads │ │ - Low latency │ │
│ │ - Point-in-time │ │ - < 10ms reads │ │
│ └───────────────────┘ └───────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Training Pipeline │ │ Serving Endpoint │ │
│ │ (Kubeflow/Airflow)│ │ (FastAPI/gRPC) │ │
│ └───────────────────┘ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Offline Store
- Stores the full history of feature values
- Used for training data generation with point-in-time correctness
- Backed by data warehouses (BigQuery, Snowflake) or object storage (S3/Parquet)
- Batch reads, high throughput, higher latency acceptable
Online Store
- Stores the latest feature values for each entity
- Used for real-time serving (model inference)
- Backed by low-latency key-value stores (Redis, DynamoDB)
- Single-digit millisecond reads required
Implementing with Feast
Feast is the most popular open-source feature store. Here's a complete implementation:
Define Entities and Sources
# features/entities.py
from feast import Entity, ValueType
customer = Entity(
name="customer_id",
value_type=ValueType.INT64,
description="Unique customer identifier",
)
product = Entity(
name="product_id",
value_type=ValueType.STRING,
description="Product SKU",
)# features/sources.py
from feast import FileSource, BigQuerySource
from datetime import timedelta
# For development: file-based source
customer_source_file = FileSource(
path="data/customer_features.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created_timestamp",
)
# For production: BigQuery source
customer_source_bq = BigQuerySource(
table="ml_features.customer_daily_features",
timestamp_field="event_timestamp",
created_timestamp_column="created_timestamp",
)
# Real-time source for streaming features
transaction_source = FileSource(
path="data/transaction_features.parquet",
timestamp_field="event_timestamp",
)Define Feature Views
# features/feature_views.py
from feast import FeatureView, Feature, ValueType
from datetime import timedelta
from features.entities import customer, product
from features.sources import customer_source_bq, transaction_source
customer_profile_features = FeatureView(
name="customer_profile",
entities=[customer],
ttl=timedelta(days=1),
features=[
Feature(name="account_age_days", dtype=ValueType.INT64),
Feature(name="total_orders", dtype=ValueType.INT64),
Feature(name="lifetime_value", dtype=ValueType.DOUBLE),
Feature(name="preferred_category", dtype=ValueType.STRING),
Feature(name="email_verified", dtype=ValueType.BOOL),
],
online=True,
source=customer_source_bq,
tags={"team": "ml-platform", "domain": "customer"},
)
customer_behavioral_features = FeatureView(
name="customer_behavior",
entities=[customer],
ttl=timedelta(hours=6), # More frequently updated
features=[
Feature(name="purchases_last_7d", dtype=ValueType.INT64),
Feature(name="purchases_last_30d", dtype=ValueType.INT64),
Feature(name="avg_order_value_30d", dtype=ValueType.DOUBLE),
Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
Feature(name="cart_abandonment_rate_30d", dtype=ValueType.DOUBLE),
Feature(name="support_tickets_last_30d", dtype=ValueType.INT64),
Feature(name="page_views_last_7d", dtype=ValueType.INT64),
],
online=True,
source=customer_source_bq,
tags={"team": "ml-platform", "domain": "customer"},
)Apply and Materialize
# Apply feature definitions
feast apply
# Materialize features to the online store
feast materialize 2026-01-01T00:00:00 2026-02-05T00:00:00
# Incremental materialization (run daily)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")Training: Point-in-Time Feature Retrieval
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path="features/")
# Training entity DataFrame with timestamps
entity_df = pd.DataFrame({
"customer_id": [1001, 1002, 1003, 1001, 1002],
"event_timestamp": pd.to_datetime([
"2026-01-15", "2026-01-15", "2026-01-15",
"2026-02-01", "2026-02-01",
]),
"label": [1, 0, 0, 1, 1], # churn labels
})
# Get historical features with point-in-time correctness
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"customer_profile:account_age_days",
"customer_profile:total_orders",
"customer_profile:lifetime_value",
"customer_behavior:purchases_last_30d",
"customer_behavior:avg_order_value_30d",
"customer_behavior:days_since_last_purchase",
"customer_behavior:cart_abandonment_rate_30d",
],
).to_df()
print(training_df.head())
# customer_id event_timestamp label account_age_days total_orders ...
# 1001 2026-01-15 1 365 23 ...
# 1002 2026-01-15 0 180 8 ...Point-in-time correctness means: for customer 1001 on 2026-01-15, the feature store returns the feature values as they were on that date — not the latest values. This prevents data leakage (using future information to predict past events).
Serving: Online Feature Retrieval
from feast import FeatureStore
store = FeatureStore(repo_path="features/")
# Get latest features for real-time prediction
online_features = store.get_online_features(
features=[
"customer_profile:account_age_days",
"customer_profile:total_orders",
"customer_profile:lifetime_value",
"customer_behavior:purchases_last_30d",
"customer_behavior:avg_order_value_30d",
"customer_behavior:days_since_last_purchase",
],
entity_rows=[{"customer_id": 1001}],
).to_dict()
print(online_features)
# {
# "customer_id": [1001],
# "account_age_days": [380],
# "total_orders": [27],
# "lifetime_value": [4523.50],
# "purchases_last_30d": [3],
# "avg_order_value_30d": [167.88],
# "days_since_last_purchase": [5],
# }Integration with Model Serving
from fastapi import FastAPI
from feast import FeatureStore
import mlflow.pyfunc
app = FastAPI()
feature_store = FeatureStore(repo_path="features/")
model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")
FEATURE_LIST = [
"customer_profile:account_age_days",
"customer_profile:total_orders",
"customer_profile:lifetime_value",
"customer_behavior:purchases_last_30d",
"customer_behavior:avg_order_value_30d",
"customer_behavior:days_since_last_purchase",
"customer_behavior:cart_abandonment_rate_30d",
]
@app.post("/predict/churn")
async def predict_churn(customer_id: int):
# Fetch features from online store
features = feature_store.get_online_features(
features=FEATURE_LIST,
entity_rows=[{"customer_id": customer_id}],
).to_df()
# Run prediction
prediction = model.predict(features.drop(columns=["customer_id"]))
return {
"customer_id": customer_id,
"churn_probability": float(prediction[0]),
"features_used": features.to_dict(orient="records")[0],
}Design Pattern: On-Demand Features
Some features need to be computed at request time (e.g., "time since last page view" changes every second). Use on-demand feature views:
from feast import on_demand_feature_view, Field
from feast.types import Float64
@on_demand_feature_view(
sources=[customer_behavioral_features],
schema=[
Field(name="recency_score", dtype=Float64),
Field(name="activity_score", dtype=Float64),
],
)
def customer_scores(inputs: pd.DataFrame) -> pd.DataFrame:
"""Compute derived features on demand."""
df = pd.DataFrame()
# Recency score: 0-1, higher = more recent
df["recency_score"] = 1.0 / (1.0 + inputs["days_since_last_purchase"] / 30.0)
# Activity score: combines multiple signals
df["activity_score"] = (
inputs["purchases_last_7d"] * 0.4
+ inputs["page_views_last_7d"] * 0.01
+ (1 - inputs["cart_abandonment_rate_30d"]) * 0.3
)
return dfDesign Pattern: Streaming Features
For real-time features that update with every event (e.g., running count of purchases today):
# Kafka → Flink/Spark Streaming → Online Store
# Flink job to compute streaming features
from pyflink.table import StreamTableEnvironment
t_env = StreamTableEnvironment.create()
# Read from Kafka
t_env.execute_sql("""
CREATE TABLE transactions (
customer_id BIGINT,
order_value DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'transactions',
'format' = 'json'
)
""")
# Compute streaming features
t_env.execute_sql("""
INSERT INTO feature_store_sink
SELECT
customer_id,
COUNT(*) OVER w AS purchases_today,
SUM(order_value) OVER w AS spend_today,
AVG(order_value) OVER w AS avg_order_today
FROM transactions
WINDOW w AS (
PARTITION BY customer_id
ORDER BY event_time
RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND CURRENT ROW
)
""")Custom Feature Store Implementation
For teams that need more control or have simpler requirements, here's a minimal custom feature store:
from datetime import datetime
from typing import Any
import redis
import pyarrow.parquet as pq
class SimpleFeatureStore:
"""Minimal feature store with offline (Parquet) + online (Redis) stores."""
def __init__(self, offline_path: str, redis_url: str):
self.offline_path = offline_path
self.redis = redis.from_url(redis_url)
def materialize(self, feature_view: str, df):
"""Write features to both offline and online stores."""
# Offline: append to Parquet
path = f"{self.offline_path}/{feature_view}"
pq.write_to_dataset(df.to_arrow(), path, partition_cols=["date"])
# Online: write latest values to Redis
for _, row in df.iterrows():
entity_key = f"{feature_view}:{row['entity_id']}"
features = row.drop(["entity_id", "event_timestamp", "date"]).to_dict()
self.redis.hset(entity_key, mapping={k: str(v) for k, v in features.items()})
self.redis.expire(entity_key, 86400 * 7) # 7-day TTL
def get_online_features(self, feature_view: str, entity_id: str) -> dict[str, Any]:
"""Fetch latest features from Redis."""
key = f"{feature_view}:{entity_id}"
raw = self.redis.hgetall(key)
return {k.decode(): float(v.decode()) for k, v in raw.items()}
def get_historical_features(self, feature_view: str, entity_df) -> Any:
"""Point-in-time join from offline store."""
import pandas as pd
features_df = pd.read_parquet(f"{self.offline_path}/{feature_view}")
# Point-in-time join: for each entity+timestamp, get the most recent feature values
merged = pd.merge_asof(
entity_df.sort_values("event_timestamp"),
features_df.sort_values("event_timestamp"),
on="event_timestamp",
by="entity_id",
direction="backward",
)
return mergedFeature Store Comparison
| Feature Store | Type | Online Store | Offline Store | Streaming | Best For | |---------------|------|-------------|--------------|-----------|----------| | Feast | Open source | Redis, DynamoDB | BigQuery, S3 | Kafka → push | Self-hosted, any cloud | | Tecton | Managed | DynamoDB | S3/Snowflake | Native | Enterprise, real-time | | Hopsworks | Open source | RonDB | Hudi | Flink | On-prem, all-in-one | | Databricks | Managed | DynamoDB | Delta Lake | Spark | Databricks users | | Vertex AI | Managed | Bigtable | BigQuery | Dataflow | GCP users |
When Do You Need a Feature Store?
You need one if:
- Multiple models share the same features
- Training and serving use different data sources
- You need point-in-time correctness for training
- Feature computation is expensive and should be cached
- You need real-time features with < 50ms latency
You probably don't need one if:
- You have a single model in production
- Features are simple transformations of raw input
- All models are batch (no real-time serving)
- Your team is < 3 ML engineers
Next Steps
- MLflow — Track experiments and manage models alongside your feature store
- Kubeflow Pipelines — Orchestrate feature computation and training pipelines
- MLOps best practices — Integrate feature stores into your ML workflow
- What is MLOps? — Understand where feature stores fit in the MLOps lifecycle
Building a feature store for your ML platform? DeviDevs designs and implements production feature stores with Feast, custom solutions, or managed platforms. Get a free assessment →