Best Practices
Reproducibility, evaluation, monitoring, GPU cost, LLM safety, common pitfalls, when to fine-tune vs RAG vs prompt
Best Practices
The operational realities of running ML and AI in production without burning cash or shipping unsafe systems.
Reproducibility
The hardest discipline. Required for: rolling back, debugging, audit, compliance, future improvement.
- Pin everything: Python deps (
requirements.txtwith hashes), CUDA version, OS image - Container the training: same Dockerfile in dev and prod training
- Seed deterministic code paths (
torch.manual_seed,numpy.random.seed) — even if not 100% reproducible due to CUDA non-determinism, it gets close - Version data: data snapshot ID logged with every training run
- Track configs: every hyperparameter, every code SHA, every data version, in MLflow / W&B
A 6-month-old model that has a customer impact bug: with reproducibility, you can rebuild it, reproduce the bug, find the cause. Without, you guess.
Evaluation as a System
The eval pipeline matters as much as the training pipeline:
- Hold out data the model never sees during training/tuning
- Slice by important segments: country, customer tier, device type — overall accuracy is misleading
- Behavioral tests: specific inputs that must produce specific outputs (regression suite)
- Fairness checks: no protected group should be systematically worse
- Robustness: how does the model handle adversarial / noisy input?
In CI: model build fails if metrics on critical slices regress. No "we'll check later" — automate.
For LLMs:
- Build an eval set from real failures (production examples with desired/actual outputs)
- LLM-as-judge with rubrics; calibrate the judge against human ratings
- Track latency + cost as first-class metrics alongside quality
Monitoring in Production
Beyond classical metrics:
| Metric | Threshold to alert |
|---|---|
| Prediction latency P99 | > 2× baseline |
| Error rate | > 1% sustained |
| Feature distribution drift (per feature) | PSI > 0.2 |
| Prediction distribution drift | KL divergence > 0.1 |
| Online accuracy (if labels available) | Below baseline - delta |
| Token usage (LLM) | Above budget |
| LLM call failure rate | > 0.5% |
| Cost per request | Above expected |
For LLMs, also log:
- Sample inputs/outputs (with PII redaction)
- Retrieved context (for RAG)
- Token counts
- Cost per request
This is where Langfuse / LangSmith / Phoenix / Helicone pay off. Connect to Observability Pipelines so the data flows like other telemetry.
GPU Cost
GPUs are the dominant cost line in serious ML/AI:
- Don't leave them idle: GPU at 10% utilization costs the same as 100%. Schedule training and serving with autoscaling and bin-packing.
- Spot for training: training tolerates interruption; use Spot/preemptible. Save checkpoints frequently.
- On-demand for serving: latency-sensitive serving = reserved or on-demand
- Smaller models: every drop in model size cuts GPU need linearly or more
- Quantization: FP16 → INT8 → INT4 is 2-4× throughput, often acceptable quality
- Mixed-precision training: trains as fast as FP16, same accuracy as FP32
- Multi-tenancy: serving multiple models on one GPU (Triton's model concurrency)
LLM API vs. self-host trade-off:
- API (OpenAI, Anthropic, Bedrock): zero infra, pay per token, scales infinitely, lock-in risk
- Self-host (vLLM, TGI on your GPUs): predictable cost at scale (~$0.10-0.50/M tokens vs OpenAI's $5-30), more work to operate
Crossover point varies, but above ~$10k/month in OpenAI bills, self-hosting often wins. Below, the API saves engineering time.
Connect to FinOps for ML cost attribution.
Safety and Guardrails
LLMs have unique safety concerns:
| Risk | Mitigation |
|---|---|
| Prompt injection | Treat user input as untrusted; sandbox tool calls; output validation |
| Hallucinations | RAG with citation; eval for factuality; user-facing "this may be wrong" |
| PII leakage | Don't put unnecessary PII in prompts; redact in logs |
| Toxic / unsafe output | Output filters (LlamaGuard, OpenAI Moderation); refuse list |
| Jailbreaks | System prompts that resist; output classifiers |
| Cost runaway | Rate limits per user/key/IP; max-tokens limits; budget alerts |
| Data exfiltration via tools | Restrict what tools can do; URL allowlists; sandbox code execution |
| Copyright / IP | Avoid verbatim reproduction; train on licensed data |
Tools: NeMo Guardrails, Guardrails AI, Lakera Guard, Protect AI.
Test adversarially. Red-team your LLM app before shipping. Standard prompt injection payloads are well known; if your app falls to "Ignore previous instructions and..." you have a problem.
Compliance and Governance
ML adds compliance shape:
- EU AI Act (in force 2025-2026): risk categorization, transparency, human oversight for high-risk
- GDPR: data subject rights apply to training data; "right to explanation" emerging
- Model cards: document intended use, training data, performance, limits
- Audit trail: who deployed which model, when, with what eval
- Bias and fairness: documented testing on protected groups
- Data lineage: which training rows produced this model
Internal Developer Platforms often surface model cards alongside service docs.
Productionization Checklist for a Model
A model going to production:
- Has a pipeline — not a notebook run by a human
- Has eval metrics — on holdout, on important slices
- Beats baseline by a meaningful margin
- Is shadow-tested or canaried before full rollout
- Is monitored — drift, performance, business metrics
- Has a rollback plan — flip back to previous version
- Has on-call ownership — someone is paged when it breaks
- Has cost projections — and budget alerts
- Has documentation — what it does, what it doesn't, limits
If a model lacks half of these, it's not production-ready, regardless of how good its accuracy is.
When to Fine-Tune vs RAG vs Prompt
A practical decision tree:
Do you have data that defines the desired behavior?
├── No → prompting + few-shot examples
└── Yes
├── Is the behavior pattern (style, format)?
│ ├── Yes → fine-tuning
│ └── No → maybe still fine-tuning, but try prompting first
└── Is the issue "model doesn't know this fact"?
├── Yes → RAG (retrieve facts from your data)
└── No → prompting
Should the model take actions in the world?
└── Yes → agents + tool use (with guardrails)Common path: prompting → prompting + RAG → fine-tuning + RAG. Each step costs more engineering but reduces inference cost and improves consistency.
Multi-Tenancy and Quotas
If your platform serves multiple internal teams or customers:
- Per-team API keys / namespaces
- Rate limits per key
- Token budgets per month
- Model whitelists per team (some teams not allowed to use the most expensive models)
- Showback for cost attribution
- Isolated inference for sensitive workloads (no shared KV cache across tenants)
Connect to API Gateway for enforcement.
Versioning Strategy
A working scheme:
- Code version: Git SHA
- Data version: snapshot ID or LakeFS commit
- Model version: registry version number
- Composite "release": tags
model-name:v1.2.3corresponds to a code + data + model triple
Production references aliases (prod, staging, canary), not specific versions. Aliases shift; redeploy isn't required to change model.
Build vs. Buy
When to build the platform vs. buy:
| Need | Build (OSS) | Buy (managed) |
|---|---|---|
| Experiment tracking | MLflow | W&B, Comet, Databricks |
| Training infra | Ray + K8s | SageMaker, Vertex |
| Serving | vLLM, BentoML | Bedrock, OpenAI, Replicate |
| Feature store | Feast | Tecton, Databricks FS |
| LLM observability | Langfuse, Phoenix | LangSmith, Helicone |
| Vector DB | pgvector, Qdrant | Pinecone, Weaviate Cloud |
| Eval | Promptfoo | Confident AI, Arize |
The right answer depends on team size, cost sensitivity, and operational appetite. Small teams should buy almost everything. Large teams may build the pieces that are core to differentiation, buy the rest.
Common Pitfalls
Notebook-as-production. Data scientist runs a notebook weekly to refresh predictions. No version control, no monitoring. Productionize the moment a model has users.
"We don't need MLOps yet". Said before the first model. After the first model breaks in production and nobody can debug it, the team buys MLOps fast. Build the loop before the first model, not after.
Skipping eval. Shipping LLM features by feel. Six months later, you change the model and have no way to know if it improved. Eval set early; grow it from production examples.
Training/serving skew. Featurized differently at training vs. serving — silent prediction bugs. Feature store or shared library; no copy-pasted feature code.
Auto-retraining without checks. Retraining nightly; one night the upstream data has a bug; model degrades; deploys; affects users. Always evaluate before promoting in automated pipelines.
LLM call in the hot path with no fallback. LLM API rate-limits; your feature is dead. Cache, fall back, degrade gracefully.
Prompt injection ignored. User input goes into the prompt. Adversarial user crafts payload. Treat as untrusted; validate output; constrain tool use.
Cost blindness. LLM bill goes from $1k to $50k/month nobody notices for weeks. Real-time cost alerts; per-feature attribution.
Building a "platform" for one team. Three engineers spend a year building Kubeflow + Feast + Triton + W&B for one product team. Pick the boring path; invest in platform when you have multiple teams.
Checklist
MLOps production readiness:
- All models have versioning (code + data + weights)
- Reproducible training (containerized, seeded, configs versioned)
- Eval set exists; CI fails if metrics regress on critical slices
- Shadow or canary deployment for new models
- Monitoring: latency, accuracy, drift, cost, errors
- Rollback procedure tested
- On-call ownership for each model in production
- Cost budget per model or feature; alerts on overrun
- LLM apps: prompt injection tests, output validation, PII redaction in logs
- RAG apps: eval set, retrieval quality measured, hybrid search considered
- Model cards / documentation for each production model
- GPU utilization tracked; bin-packing or autoscaling
- Training data versioned and lineage tracked
- Feature store or shared featurization to prevent skew
- Compliance: model registry audit trail, bias testing, data lineage
What's Next
You have an MLOps practice. Connect it to:
- Data Warehouses — training data lives here
- Workflow Orchestration — training and feature pipelines
- Vector Databases — RAG storage
- Object Storage — model artifacts, training data
- FinOps — GPU and LLM cost dominate; track them
- Observability Pipelines — LLM trace data flows like other telemetry
- Internal Developer Platforms — model catalog and self-serve deployment