Best Practices

The operational realities of running ML and AI in production without burning cash or shipping unsafe systems.

Reproducibility

The hardest discipline. Required for: rolling back, debugging, audit, compliance, future improvement.

Pin everything: Python deps (requirements.txt with hashes), CUDA version, OS image
Container the training: same Dockerfile in dev and prod training
Seed deterministic code paths (torch.manual_seed, numpy.random.seed) — even if not 100% reproducible due to CUDA non-determinism, it gets close
Version data: data snapshot ID logged with every training run
Track configs: every hyperparameter, every code SHA, every data version, in MLflow / W&B

A 6-month-old model that has a customer impact bug: with reproducibility, you can rebuild it, reproduce the bug, find the cause. Without, you guess.

Evaluation as a System

The eval pipeline matters as much as the training pipeline:

Hold out data the model never sees during training/tuning
Slice by important segments: country, customer tier, device type — overall accuracy is misleading
Behavioral tests: specific inputs that must produce specific outputs (regression suite)
Fairness checks: no protected group should be systematically worse
Robustness: how does the model handle adversarial / noisy input?

In CI: model build fails if metrics on critical slices regress. No "we'll check later" — automate.

For LLMs:

Build an eval set from real failures (production examples with desired/actual outputs)
LLM-as-judge with rubrics; calibrate the judge against human ratings
Track latency + cost as first-class metrics alongside quality

Monitoring in Production

Beyond classical metrics:

Metric	Threshold to alert
Prediction latency P99	> 2× baseline
Error rate	> 1% sustained
Feature distribution drift (per feature)	PSI > 0.2
Prediction distribution drift	KL divergence > 0.1
Online accuracy (if labels available)	Below baseline - delta
Token usage (LLM)	Above budget
LLM call failure rate	> 0.5%
Cost per request	Above expected

For LLMs, also log:

Sample inputs/outputs (with PII redaction)
Retrieved context (for RAG)
Token counts
Cost per request

This is where Langfuse / LangSmith / Phoenix / Helicone pay off. Connect to Observability Pipelines so the data flows like other telemetry.

GPU Cost

GPUs are the dominant cost line in serious ML/AI:

Don't leave them idle: GPU at 10% utilization costs the same as 100%. Schedule training and serving with autoscaling and bin-packing.
Spot for training: training tolerates interruption; use Spot/preemptible. Save checkpoints frequently.
On-demand for serving: latency-sensitive serving = reserved or on-demand
Smaller models: every drop in model size cuts GPU need linearly or more
Quantization: FP16 → INT8 → INT4 is 2-4× throughput, often acceptable quality
Mixed-precision training: trains as fast as FP16, same accuracy as FP32
Multi-tenancy: serving multiple models on one GPU (Triton's model concurrency)

LLM API vs. self-host trade-off:

API (OpenAI, Anthropic, Bedrock): zero infra, pay per token, scales infinitely, lock-in risk
Self-host (vLLM, TGI on your GPUs): predictable cost at scale (~$0.10-0.50/M tokens vs OpenAI's $5-30), more work to operate

Crossover point varies, but above ~$10k/month in OpenAI bills, self-hosting often wins. Below, the API saves engineering time.

Connect to FinOps for ML cost attribution.

Safety and Guardrails

LLMs have unique safety concerns:

Risk	Mitigation
Prompt injection	Treat user input as untrusted; sandbox tool calls; output validation
Hallucinations	RAG with citation; eval for factuality; user-facing "this may be wrong"
PII leakage	Don't put unnecessary PII in prompts; redact in logs
Toxic / unsafe output	Output filters (LlamaGuard, OpenAI Moderation); refuse list
Jailbreaks	System prompts that resist; output classifiers
Cost runaway	Rate limits per user/key/IP; max-tokens limits; budget alerts
Data exfiltration via tools	Restrict what tools can do; URL allowlists; sandbox code execution
Copyright / IP	Avoid verbatim reproduction; train on licensed data

Tools: NeMo Guardrails, Guardrails AI, Lakera Guard, Protect AI.

Test adversarially. Red-team your LLM app before shipping. Standard prompt injection payloads are well known; if your app falls to "Ignore previous instructions and..." you have a problem.

Compliance and Governance

ML adds compliance shape:

EU AI Act (in force 2025-2026): risk categorization, transparency, human oversight for high-risk
GDPR: data subject rights apply to training data; "right to explanation" emerging
Model cards: document intended use, training data, performance, limits
Audit trail: who deployed which model, when, with what eval
Bias and fairness: documented testing on protected groups
Data lineage: which training rows produced this model

Internal Developer Platforms often surface model cards alongside service docs.

Productionization Checklist for a Model

A model going to production:

Has a pipeline — not a notebook run by a human
Has eval metrics — on holdout, on important slices
Beats baseline by a meaningful margin
Is shadow-tested or canaried before full rollout
Is monitored — drift, performance, business metrics
Has a rollback plan — flip back to previous version
Has on-call ownership — someone is paged when it breaks
Has cost projections — and budget alerts
Has documentation — what it does, what it doesn't, limits

If a model lacks half of these, it's not production-ready, regardless of how good its accuracy is.

When to Fine-Tune vs RAG vs Prompt

A practical decision tree:

Do you have data that defines the desired behavior?
├── No → prompting + few-shot examples
└── Yes
    ├── Is the behavior pattern (style, format)?
    │   ├── Yes → fine-tuning
    │   └── No → maybe still fine-tuning, but try prompting first
    └── Is the issue "model doesn't know this fact"?
        ├── Yes → RAG (retrieve facts from your data)
        └── No → prompting

Should the model take actions in the world?
└── Yes → agents + tool use (with guardrails)

Common path: prompting → prompting + RAG → fine-tuning + RAG. Each step costs more engineering but reduces inference cost and improves consistency.

Multi-Tenancy and Quotas

If your platform serves multiple internal teams or customers:

Per-team API keys / namespaces
Rate limits per key
Token budgets per month
Model whitelists per team (some teams not allowed to use the most expensive models)
Showback for cost attribution
Isolated inference for sensitive workloads (no shared KV cache across tenants)

Connect to API Gateway for enforcement.

Versioning Strategy

A working scheme:

Code version: Git SHA
Data version: snapshot ID or LakeFS commit
Model version: registry version number
Composite "release": tags model-name:v1.2.3 corresponds to a code + data + model triple

Production references aliases (prod, staging, canary), not specific versions. Aliases shift; redeploy isn't required to change model.

Build vs. Buy

When to build the platform vs. buy:

Need	Build (OSS)	Buy (managed)
Experiment tracking	MLflow	W&B, Comet, Databricks
Training infra	Ray + K8s	SageMaker, Vertex
Serving	vLLM, BentoML	Bedrock, OpenAI, Replicate
Feature store	Feast	Tecton, Databricks FS
LLM observability	Langfuse, Phoenix	LangSmith, Helicone
Vector DB	pgvector, Qdrant	Pinecone, Weaviate Cloud
Eval	Promptfoo	Confident AI, Arize

The right answer depends on team size, cost sensitivity, and operational appetite. Small teams should buy almost everything. Large teams may build the pieces that are core to differentiation, buy the rest.

Common Pitfalls

Notebook-as-production. Data scientist runs a notebook weekly to refresh predictions. No version control, no monitoring. Productionize the moment a model has users.

"We don't need MLOps yet". Said before the first model. After the first model breaks in production and nobody can debug it, the team buys MLOps fast. Build the loop before the first model, not after.

Skipping eval. Shipping LLM features by feel. Six months later, you change the model and have no way to know if it improved. Eval set early; grow it from production examples.

Training/serving skew. Featurized differently at training vs. serving — silent prediction bugs. Feature store or shared library; no copy-pasted feature code.

Auto-retraining without checks. Retraining nightly; one night the upstream data has a bug; model degrades; deploys; affects users. Always evaluate before promoting in automated pipelines.

LLM call in the hot path with no fallback. LLM API rate-limits; your feature is dead. Cache, fall back, degrade gracefully.

Prompt injection ignored. User input goes into the prompt. Adversarial user crafts payload. Treat as untrusted; validate output; constrain tool use.

Cost blindness. LLM bill goes from $1k to $50k/month nobody notices for weeks. Real-time cost alerts; per-feature attribution.

Building a "platform" for one team. Three engineers spend a year building Kubeflow + Feast + Triton + W&B for one product team. Pick the boring path; invest in platform when you have multiple teams.

Checklist

What's Next

You have an MLOps practice. Connect it to:

Data Warehouses — training data lives here
Workflow Orchestration — training and feature pipelines
Vector Databases — RAG storage
Object Storage — model artifacts, training data
FinOps — GPU and LLM cost dominate; track them
Observability Pipelines — LLM trace data flows like other telemetry
Internal Developer Platforms — model catalog and self-serve deployment

Best Practices

Reproducibility

Evaluation as a System

Monitoring in Production

GPU Cost

Safety and Guardrails

Compliance and Governance

Productionization Checklist for a Model

When to Fine-Tune vs RAG vs Prompt

Multi-Tenancy and Quotas

Versioning Strategy

Build vs. Buy

Common Pitfalls

Checklist

What's Next

Best Practices

On this page