Steven's Knowledge

Best Practices

Reproducibility, evaluation, monitoring, GPU cost, LLM safety, common pitfalls, when to fine-tune vs RAG vs prompt

Best Practices

The operational realities of running ML and AI in production without burning cash or shipping unsafe systems.

Reproducibility

The hardest discipline. Required for: rolling back, debugging, audit, compliance, future improvement.

  • Pin everything: Python deps (requirements.txt with hashes), CUDA version, OS image
  • Container the training: same Dockerfile in dev and prod training
  • Seed deterministic code paths (torch.manual_seed, numpy.random.seed) — even if not 100% reproducible due to CUDA non-determinism, it gets close
  • Version data: data snapshot ID logged with every training run
  • Track configs: every hyperparameter, every code SHA, every data version, in MLflow / W&B

A 6-month-old model that has a customer impact bug: with reproducibility, you can rebuild it, reproduce the bug, find the cause. Without, you guess.

Evaluation as a System

The eval pipeline matters as much as the training pipeline:

  • Hold out data the model never sees during training/tuning
  • Slice by important segments: country, customer tier, device type — overall accuracy is misleading
  • Behavioral tests: specific inputs that must produce specific outputs (regression suite)
  • Fairness checks: no protected group should be systematically worse
  • Robustness: how does the model handle adversarial / noisy input?

In CI: model build fails if metrics on critical slices regress. No "we'll check later" — automate.

For LLMs:

  • Build an eval set from real failures (production examples with desired/actual outputs)
  • LLM-as-judge with rubrics; calibrate the judge against human ratings
  • Track latency + cost as first-class metrics alongside quality

Monitoring in Production

Beyond classical metrics:

MetricThreshold to alert
Prediction latency P99> 2× baseline
Error rate> 1% sustained
Feature distribution drift (per feature)PSI > 0.2
Prediction distribution driftKL divergence > 0.1
Online accuracy (if labels available)Below baseline - delta
Token usage (LLM)Above budget
LLM call failure rate> 0.5%
Cost per requestAbove expected

For LLMs, also log:

  • Sample inputs/outputs (with PII redaction)
  • Retrieved context (for RAG)
  • Token counts
  • Cost per request

This is where Langfuse / LangSmith / Phoenix / Helicone pay off. Connect to Observability Pipelines so the data flows like other telemetry.

GPU Cost

GPUs are the dominant cost line in serious ML/AI:

  • Don't leave them idle: GPU at 10% utilization costs the same as 100%. Schedule training and serving with autoscaling and bin-packing.
  • Spot for training: training tolerates interruption; use Spot/preemptible. Save checkpoints frequently.
  • On-demand for serving: latency-sensitive serving = reserved or on-demand
  • Smaller models: every drop in model size cuts GPU need linearly or more
  • Quantization: FP16 → INT8 → INT4 is 2-4× throughput, often acceptable quality
  • Mixed-precision training: trains as fast as FP16, same accuracy as FP32
  • Multi-tenancy: serving multiple models on one GPU (Triton's model concurrency)

LLM API vs. self-host trade-off:

  • API (OpenAI, Anthropic, Bedrock): zero infra, pay per token, scales infinitely, lock-in risk
  • Self-host (vLLM, TGI on your GPUs): predictable cost at scale (~$0.10-0.50/M tokens vs OpenAI's $5-30), more work to operate

Crossover point varies, but above ~$10k/month in OpenAI bills, self-hosting often wins. Below, the API saves engineering time.

Connect to FinOps for ML cost attribution.

Safety and Guardrails

LLMs have unique safety concerns:

RiskMitigation
Prompt injectionTreat user input as untrusted; sandbox tool calls; output validation
HallucinationsRAG with citation; eval for factuality; user-facing "this may be wrong"
PII leakageDon't put unnecessary PII in prompts; redact in logs
Toxic / unsafe outputOutput filters (LlamaGuard, OpenAI Moderation); refuse list
JailbreaksSystem prompts that resist; output classifiers
Cost runawayRate limits per user/key/IP; max-tokens limits; budget alerts
Data exfiltration via toolsRestrict what tools can do; URL allowlists; sandbox code execution
Copyright / IPAvoid verbatim reproduction; train on licensed data

Tools: NeMo Guardrails, Guardrails AI, Lakera Guard, Protect AI.

Test adversarially. Red-team your LLM app before shipping. Standard prompt injection payloads are well known; if your app falls to "Ignore previous instructions and..." you have a problem.

Compliance and Governance

ML adds compliance shape:

  • EU AI Act (in force 2025-2026): risk categorization, transparency, human oversight for high-risk
  • GDPR: data subject rights apply to training data; "right to explanation" emerging
  • Model cards: document intended use, training data, performance, limits
  • Audit trail: who deployed which model, when, with what eval
  • Bias and fairness: documented testing on protected groups
  • Data lineage: which training rows produced this model

Internal Developer Platforms often surface model cards alongside service docs.

Productionization Checklist for a Model

A model going to production:

  1. Has a pipeline — not a notebook run by a human
  2. Has eval metrics — on holdout, on important slices
  3. Beats baseline by a meaningful margin
  4. Is shadow-tested or canaried before full rollout
  5. Is monitored — drift, performance, business metrics
  6. Has a rollback plan — flip back to previous version
  7. Has on-call ownership — someone is paged when it breaks
  8. Has cost projections — and budget alerts
  9. Has documentation — what it does, what it doesn't, limits

If a model lacks half of these, it's not production-ready, regardless of how good its accuracy is.

When to Fine-Tune vs RAG vs Prompt

A practical decision tree:

Do you have data that defines the desired behavior?
├── No → prompting + few-shot examples
└── Yes
    ├── Is the behavior pattern (style, format)?
    │   ├── Yes → fine-tuning
    │   └── No → maybe still fine-tuning, but try prompting first
    └── Is the issue "model doesn't know this fact"?
        ├── Yes → RAG (retrieve facts from your data)
        └── No → prompting

Should the model take actions in the world?
└── Yes → agents + tool use (with guardrails)

Common path: prompting → prompting + RAG → fine-tuning + RAG. Each step costs more engineering but reduces inference cost and improves consistency.

Multi-Tenancy and Quotas

If your platform serves multiple internal teams or customers:

  • Per-team API keys / namespaces
  • Rate limits per key
  • Token budgets per month
  • Model whitelists per team (some teams not allowed to use the most expensive models)
  • Showback for cost attribution
  • Isolated inference for sensitive workloads (no shared KV cache across tenants)

Connect to API Gateway for enforcement.

Versioning Strategy

A working scheme:

  • Code version: Git SHA
  • Data version: snapshot ID or LakeFS commit
  • Model version: registry version number
  • Composite "release": tags model-name:v1.2.3 corresponds to a code + data + model triple

Production references aliases (prod, staging, canary), not specific versions. Aliases shift; redeploy isn't required to change model.

Build vs. Buy

When to build the platform vs. buy:

NeedBuild (OSS)Buy (managed)
Experiment trackingMLflowW&B, Comet, Databricks
Training infraRay + K8sSageMaker, Vertex
ServingvLLM, BentoMLBedrock, OpenAI, Replicate
Feature storeFeastTecton, Databricks FS
LLM observabilityLangfuse, PhoenixLangSmith, Helicone
Vector DBpgvector, QdrantPinecone, Weaviate Cloud
EvalPromptfooConfident AI, Arize

The right answer depends on team size, cost sensitivity, and operational appetite. Small teams should buy almost everything. Large teams may build the pieces that are core to differentiation, buy the rest.

Common Pitfalls

Notebook-as-production. Data scientist runs a notebook weekly to refresh predictions. No version control, no monitoring. Productionize the moment a model has users.

"We don't need MLOps yet". Said before the first model. After the first model breaks in production and nobody can debug it, the team buys MLOps fast. Build the loop before the first model, not after.

Skipping eval. Shipping LLM features by feel. Six months later, you change the model and have no way to know if it improved. Eval set early; grow it from production examples.

Training/serving skew. Featurized differently at training vs. serving — silent prediction bugs. Feature store or shared library; no copy-pasted feature code.

Auto-retraining without checks. Retraining nightly; one night the upstream data has a bug; model degrades; deploys; affects users. Always evaluate before promoting in automated pipelines.

LLM call in the hot path with no fallback. LLM API rate-limits; your feature is dead. Cache, fall back, degrade gracefully.

Prompt injection ignored. User input goes into the prompt. Adversarial user crafts payload. Treat as untrusted; validate output; constrain tool use.

Cost blindness. LLM bill goes from $1k to $50k/month nobody notices for weeks. Real-time cost alerts; per-feature attribution.

Building a "platform" for one team. Three engineers spend a year building Kubeflow + Feast + Triton + W&B for one product team. Pick the boring path; invest in platform when you have multiple teams.

Checklist

MLOps production readiness:

  • All models have versioning (code + data + weights)
  • Reproducible training (containerized, seeded, configs versioned)
  • Eval set exists; CI fails if metrics regress on critical slices
  • Shadow or canary deployment for new models
  • Monitoring: latency, accuracy, drift, cost, errors
  • Rollback procedure tested
  • On-call ownership for each model in production
  • Cost budget per model or feature; alerts on overrun
  • LLM apps: prompt injection tests, output validation, PII redaction in logs
  • RAG apps: eval set, retrieval quality measured, hybrid search considered
  • Model cards / documentation for each production model
  • GPU utilization tracked; bin-packing or autoscaling
  • Training data versioned and lineage tracked
  • Feature store or shared featurization to prevent skew
  • Compliance: model registry audit trail, bias testing, data lineage

What's Next

You have an MLOps practice. Connect it to:

On this page