Training pipelines, model versioning, feature stores, A/B testing, canary models, shadow mode, drift detection, RAG patterns

Patterns

The patterns that take you from notebook prototypes to reliable production ML.

Training Pipelines

A training "pipeline" is not just a script that fits a model:

extract data → validate → featurize → split → train → evaluate → register (if good)

Each stage:

Extract: pull from warehouse / lake, partitioned by date
Validate: check shape, types, distribution against baseline
Featurize: deterministic transforms; ideally from a feature store
Split: train/validation/test; or time-based for production data
Train: with seeds, configs versioned
Evaluate: on holdout and on important slices
Register: with metadata (data version, code version, metrics)

In Kubeflow / Vertex / SageMaker Pipelines / Flyte, each stage is a typed step. Triggered nightly, weekly, or on data update.

The pipeline is the unit of reproducibility. Re-running it on the same inputs should produce the same model (within float precision).

Model Versioning

Three things to version together:

Code (Git SHA) + Data (snapshot or version) + Configs (YAML) → Model artifact

Tools track this:

MLflow records params, metrics, model file, and an artifact hash
DVC versions large data files (pointer in Git, file in S3)
LakeFS / Iceberg time-travel versions data lake state
W&B Artifacts combines code + data + model

The principle: given a model in production, you can recover everything that made it. If you can't, debugging future regressions is guesswork.

Feature Stores

A feature store solves three problems:

Training/serving skew: featurized one way at training, another at inference → silent prediction errors. Feature store gives one transform, used in both.
Feature reuse: 5 teams compute "average order value last 30 days" 5 different ways. Centralize.
Point-in-time correctness: when training a model that predicts churn-as-of-date X, use only features that existed as of X. Hard to get right; feature stores solve it.

# Feast: define a feature view
from feast import Entity, FeatureView, Field
from feast.types import Float32

customer = Entity(name="customer_id")

avg_order_value = FeatureView(
    name="avg_order_value_30d",
    entities=[customer],
    schema=[Field(name="aov_30d", dtype=Float32)],
    source=BigQuerySource(table="analytics.customer_features_30d"),
)

# Use at training time (point-in-time)
training_df = store.get_historical_features(
    entity_df=labels_df_with_event_timestamp,
    features=["avg_order_value_30d:aov_30d"],
).to_df()

# Use at serving time (online lookup)
features = store.get_online_features(
    features=["avg_order_value_30d:aov_30d"],
    entity_rows=[{"customer_id": "abc123"}]
).to_dict()

Same definition, both contexts. Skew gone.

Deployment Patterns

You rarely just "deploy the new model." Pattern:

Pattern	Use
Shadow mode	New model runs alongside; predictions logged but not used
Canary	1% of traffic routed to new model; metrics compared
A/B test	50/50 split; compare business metrics
Multi-arm bandit	Adaptive traffic shifting based on observed reward
Blue/green	Old fully replaced by new, instantly
Cohort-based	Different models per customer segment

Critical: monitor the new model's predictions on real traffic before flipping. A model that did great in offline eval can still flunk online.

Tools: Seldon Core, KServe, BentoML have built-in canary; SageMaker Endpoints have variants; cloud LBs can split traffic with headers/cookies.

Drift Monitoring

Models degrade. Causes:

Drift	Description
Feature drift	Input distribution changes (new users, new product mix)
Concept drift	Relationship between features and target changes
Label drift	The target distribution changes
Performance drift	Accuracy/F1 falls over time

Monitor:

Feature distributions: KL divergence, PSI (Population Stability Index) per feature, per day
Prediction distribution: are we predicting "spam" 50% now vs 5% historical?
Latency: model 2x slower may be infra (cache miss) or input shape change
Online accuracy: if you have ground-truth labels (clicks, returns, conversions), track accuracy weekly

Tools: Arize, WhyLabs, Evidently (OSS), Fiddler, NannyML.

The trigger: drift detected → alert → re-evaluate → potentially retrain. Automating "retrain when drift > threshold" is doable; usually a human is in the loop the first many times.

RAG Patterns

For LLM features, the dominant pattern is RAG. Variations:

Basic RAG

query → embed → vector search → top-K chunks → LLM(prompt + chunks) → answer

The default. Works well for many use cases.

Re-ranking RAG

query → embed → vector search → top-50 chunks
        → re-ranker (cross-encoder, Cohere Rerank) → top-5 chunks
        → LLM → answer

Bi-encoder embedding is fast but coarse; cross-encoder re-ranking is slow but precise. Two-stage retrieval gets the best of both.

Hybrid RAG (vector + keyword)

query →┬→ vector search → V results
       └→ keyword search (BM25) → K results
              ↓
         reciprocal rank fusion → combined
              ↓
            LLM

Vector finds semantically similar; keyword finds exact terms (names, codes, jargon). Hybrid usually outperforms either alone on real-world queries.

Query rewriting / decomposition

user query → LLM (rewrite for retrieval) → embed → search → LLM (answer)

Or multi-hop: question split into sub-questions, each retrieved separately, results combined.

Often the best lift: make the query better, not the retrieval better.

Documents have tables, images, code blocks. Specialized embedders + LLMs that handle the modality (GPT-4 Vision, Claude with images).

Caching

query → check semantic cache → if hit, return cached → else LLM call → cache result

LLM calls are slow and expensive. Cache by semantic similarity (not exact text match) to reuse answers. Especially for common questions.

Tools: LangChain and LlamaIndex support all these patterns. Custom is fine — orchestration is sometimes overkill.

Evaluating LLM Outputs

Traditional ML: held-out test set + metrics. LLM outputs: text. How?

Approach	What
Reference-based	Compare to a golden answer (ROUGE, BLEU, BERTScore) — works for translation/summarization
LLM-as-judge	A bigger LLM evaluates: "Was this answer relevant? Faithful? Complete?"
Eval rubrics	Define criteria; check each
Behavioral tests	Specific prompts that must give specific outputs
Production feedback	Real user ratings (👍/👎); A/B vs old version

Tools: Promptfoo, Confident AI / DeepEval, Arize Phoenix, LangSmith, Langfuse.

The discipline: every change to the prompt, model, or retrieval needs to be evaluated against a known eval set before shipping. "It feels better" isn't enough.

Inference Optimization

For LLM serving, throughput matters:

Technique	Effect
Continuous batching	Pack many requests into a single GPU forward pass
Paged attention	Memory layout that allows large batch sizes
Quantization (FP16, INT8, INT4)	Smaller weights, more throughput
Speculative decoding	A small model "drafts" tokens, big model verifies
KV-cache reuse	Reuse prefix computations across requests
Model distillation	Train a smaller model to mimic a bigger one
Pruning	Remove unimportant weights

A common optimization journey: GPT-4 API → fine-tuned smaller model on vLLM → INT8 quantized model → custom-distilled small model. Each step cuts cost 2-10×, with quality trade-offs.

Fine-Tuning vs. RAG vs. Prompt

Three ways to specialize an LLM for your task:

Approach	When
Prompting	Quick iteration; small data; works most of the time
RAG	The model needs to know your facts; data changes; many sources
Fine-tuning	Need consistent format, style, or task; data is stable; willing to invest
Pretraining	Reserved for AI labs or massive enterprises

Order of attempts:

Try prompting first — cheap, fast, often enough
Add RAG when prompting alone doesn't have enough context
Fine-tune when format/style/cost matters and prompting plateaus
Combine: fine-tuned model + RAG is common at scale

Fine-tuning a 7B model on a few thousand examples costs $10-100 and a few hours on a single GPU. Try it before assuming the giant frontier model is the answer.

Anti-Patterns

No baseline metric. "Our model is great" — compared to what? Always track baseline (last week's model, last month's, naive heuristic).

No evaluation set. Shipping LLM features by vibes. Build an eval set early; grow it from real failures.

Retraining without reason. Auto-weekly retraining when drift is fine. Wastes money; can degrade if upstream data was bad that week.

Pipeline that can't reproduce. "We don't know exactly which version this is." Sloppy MLOps; future debugging is impossible.

Notebook in production. The model lives in a Jupyter notebook the data scientist runs manually. Once a feature is real, it needs a pipeline.

GPU sprawl. Each team spins up its own GPU node, runs at 5% utilization. Centralized GPU pool with scheduler (Ray, Slurm, K8s NVIDIA operator).

Prompt injection unguarded. User input goes directly into the LLM prompt; malicious user crafts a payload that exfiltrates context. Treat LLM inputs as untrusted; sandbox tool use.

Forgotten cost. LLM bill explodes from a recursion bug. Set per-app budgets and rate limits at the gateway.

What's Next

Best Practices — reproducibility, evaluation, monitoring, GPU cost, LLM safety, pitfalls

Patterns

On this page