Patterns
Training pipelines, model versioning, feature stores, A/B testing, canary models, shadow mode, drift detection, RAG patterns
Patterns
The patterns that take you from notebook prototypes to reliable production ML.
Training Pipelines
A training "pipeline" is not just a script that fits a model:
extract data → validate → featurize → split → train → evaluate → register (if good)Each stage:
- Extract: pull from warehouse / lake, partitioned by date
- Validate: check shape, types, distribution against baseline
- Featurize: deterministic transforms; ideally from a feature store
- Split: train/validation/test; or time-based for production data
- Train: with seeds, configs versioned
- Evaluate: on holdout and on important slices
- Register: with metadata (data version, code version, metrics)
In Kubeflow / Vertex / SageMaker Pipelines / Flyte, each stage is a typed step. Triggered nightly, weekly, or on data update.
The pipeline is the unit of reproducibility. Re-running it on the same inputs should produce the same model (within float precision).
Model Versioning
Three things to version together:
Code (Git SHA) + Data (snapshot or version) + Configs (YAML) → Model artifactTools track this:
- MLflow records params, metrics, model file, and an artifact hash
- DVC versions large data files (pointer in Git, file in S3)
- LakeFS / Iceberg time-travel versions data lake state
- W&B Artifacts combines code + data + model
The principle: given a model in production, you can recover everything that made it. If you can't, debugging future regressions is guesswork.
Feature Stores
A feature store solves three problems:
- Training/serving skew: featurized one way at training, another at inference → silent prediction errors. Feature store gives one transform, used in both.
- Feature reuse: 5 teams compute "average order value last 30 days" 5 different ways. Centralize.
- Point-in-time correctness: when training a model that predicts churn-as-of-date X, use only features that existed as of X. Hard to get right; feature stores solve it.
# Feast: define a feature view
from feast import Entity, FeatureView, Field
from feast.types import Float32
customer = Entity(name="customer_id")
avg_order_value = FeatureView(
name="avg_order_value_30d",
entities=[customer],
schema=[Field(name="aov_30d", dtype=Float32)],
source=BigQuerySource(table="analytics.customer_features_30d"),
)# Use at training time (point-in-time)
training_df = store.get_historical_features(
entity_df=labels_df_with_event_timestamp,
features=["avg_order_value_30d:aov_30d"],
).to_df()
# Use at serving time (online lookup)
features = store.get_online_features(
features=["avg_order_value_30d:aov_30d"],
entity_rows=[{"customer_id": "abc123"}]
).to_dict()Same definition, both contexts. Skew gone.
Deployment Patterns
You rarely just "deploy the new model." Pattern:
| Pattern | Use |
|---|---|
| Shadow mode | New model runs alongside; predictions logged but not used |
| Canary | 1% of traffic routed to new model; metrics compared |
| A/B test | 50/50 split; compare business metrics |
| Multi-arm bandit | Adaptive traffic shifting based on observed reward |
| Blue/green | Old fully replaced by new, instantly |
| Cohort-based | Different models per customer segment |
Critical: monitor the new model's predictions on real traffic before flipping. A model that did great in offline eval can still flunk online.
Tools: Seldon Core, KServe, BentoML have built-in canary; SageMaker Endpoints have variants; cloud LBs can split traffic with headers/cookies.
Drift Monitoring
Models degrade. Causes:
| Drift | Description |
|---|---|
| Feature drift | Input distribution changes (new users, new product mix) |
| Concept drift | Relationship between features and target changes |
| Label drift | The target distribution changes |
| Performance drift | Accuracy/F1 falls over time |
Monitor:
- Feature distributions: KL divergence, PSI (Population Stability Index) per feature, per day
- Prediction distribution: are we predicting "spam" 50% now vs 5% historical?
- Latency: model 2x slower may be infra (cache miss) or input shape change
- Online accuracy: if you have ground-truth labels (clicks, returns, conversions), track accuracy weekly
Tools: Arize, WhyLabs, Evidently (OSS), Fiddler, NannyML.
The trigger: drift detected → alert → re-evaluate → potentially retrain. Automating "retrain when drift > threshold" is doable; usually a human is in the loop the first many times.
RAG Patterns
For LLM features, the dominant pattern is RAG. Variations:
Basic RAG
query → embed → vector search → top-K chunks → LLM(prompt + chunks) → answerThe default. Works well for many use cases.
Re-ranking RAG
query → embed → vector search → top-50 chunks
→ re-ranker (cross-encoder, Cohere Rerank) → top-5 chunks
→ LLM → answerBi-encoder embedding is fast but coarse; cross-encoder re-ranking is slow but precise. Two-stage retrieval gets the best of both.
Hybrid RAG (vector + keyword)
query →┬→ vector search → V results
└→ keyword search (BM25) → K results
↓
reciprocal rank fusion → combined
↓
LLMVector finds semantically similar; keyword finds exact terms (names, codes, jargon). Hybrid usually outperforms either alone on real-world queries.
Query rewriting / decomposition
user query → LLM (rewrite for retrieval) → embed → search → LLM (answer)Or multi-hop: question split into sub-questions, each retrieved separately, results combined.
Often the best lift: make the query better, not the retrieval better.
Multi-modal RAG
Documents have tables, images, code blocks. Specialized embedders + LLMs that handle the modality (GPT-4 Vision, Claude with images).
Caching
query → check semantic cache → if hit, return cached → else LLM call → cache resultLLM calls are slow and expensive. Cache by semantic similarity (not exact text match) to reuse answers. Especially for common questions.
Tools: LangChain and LlamaIndex support all these patterns. Custom is fine — orchestration is sometimes overkill.
Evaluating LLM Outputs
Traditional ML: held-out test set + metrics. LLM outputs: text. How?
| Approach | What |
|---|---|
| Reference-based | Compare to a golden answer (ROUGE, BLEU, BERTScore) — works for translation/summarization |
| LLM-as-judge | A bigger LLM evaluates: "Was this answer relevant? Faithful? Complete?" |
| Eval rubrics | Define criteria; check each |
| Behavioral tests | Specific prompts that must give specific outputs |
| Production feedback | Real user ratings (👍/👎); A/B vs old version |
Tools: Promptfoo, Confident AI / DeepEval, Arize Phoenix, LangSmith, Langfuse.
The discipline: every change to the prompt, model, or retrieval needs to be evaluated against a known eval set before shipping. "It feels better" isn't enough.
Inference Optimization
For LLM serving, throughput matters:
| Technique | Effect |
|---|---|
| Continuous batching | Pack many requests into a single GPU forward pass |
| Paged attention | Memory layout that allows large batch sizes |
| Quantization (FP16, INT8, INT4) | Smaller weights, more throughput |
| Speculative decoding | A small model "drafts" tokens, big model verifies |
| KV-cache reuse | Reuse prefix computations across requests |
| Model distillation | Train a smaller model to mimic a bigger one |
| Pruning | Remove unimportant weights |
A common optimization journey: GPT-4 API → fine-tuned smaller model on vLLM → INT8 quantized model → custom-distilled small model. Each step cuts cost 2-10×, with quality trade-offs.
Fine-Tuning vs. RAG vs. Prompt
Three ways to specialize an LLM for your task:
| Approach | When |
|---|---|
| Prompting | Quick iteration; small data; works most of the time |
| RAG | The model needs to know your facts; data changes; many sources |
| Fine-tuning | Need consistent format, style, or task; data is stable; willing to invest |
| Pretraining | Reserved for AI labs or massive enterprises |
Order of attempts:
- Try prompting first — cheap, fast, often enough
- Add RAG when prompting alone doesn't have enough context
- Fine-tune when format/style/cost matters and prompting plateaus
- Combine: fine-tuned model + RAG is common at scale
Fine-tuning a 7B model on a few thousand examples costs $10-100 and a few hours on a single GPU. Try it before assuming the giant frontier model is the answer.
Anti-Patterns
No baseline metric. "Our model is great" — compared to what? Always track baseline (last week's model, last month's, naive heuristic).
No evaluation set. Shipping LLM features by vibes. Build an eval set early; grow it from real failures.
Retraining without reason. Auto-weekly retraining when drift is fine. Wastes money; can degrade if upstream data was bad that week.
Pipeline that can't reproduce. "We don't know exactly which version this is." Sloppy MLOps; future debugging is impossible.
Notebook in production. The model lives in a Jupyter notebook the data scientist runs manually. Once a feature is real, it needs a pipeline.
GPU sprawl. Each team spins up its own GPU node, runs at 5% utilization. Centralized GPU pool with scheduler (Ray, Slurm, K8s NVIDIA operator).
Prompt injection unguarded. User input goes directly into the LLM prompt; malicious user crafts a payload that exfiltrates context. Treat LLM inputs as untrusted; sandbox tool use.
Forgotten cost. LLM bill explodes from a recursion bug. Set per-app budgets and rate limits at the gateway.
What's Next
- Best Practices — reproducibility, evaluation, monitoring, GPU cost, LLM safety, pitfalls