MLOps & AI Infrastructure
MLflow, Kubeflow, Ray, BentoML, vLLM, SageMaker - training, serving, monitoring, and lifecycle for ML and AI workloads
MLOps & AI Infrastructure
MLOps is the discipline of running machine learning systems in production: training models reliably, serving them with low latency, monitoring drift, rolling them out safely, rolling them back when they break. AI infrastructure is the underlying platform — feature stores, model registries, GPU schedulers, inference servers — that makes it possible.
The category exploded in 2022-2024 as large language models moved from research toys to production. The pre-LLM playbook (train, register, deploy a sklearn model) still applies but is dwarfed by the LLM playbook (RAG, fine-tuning, inference optimization, prompt management, evaluation harnesses).
Why MLOps Is Different from DevOps
| DevOps | MLOps |
|---|---|
| Code in Git | Code + data + models — all versioned |
| Tests pass = ship | Metrics must improve over baseline — and not regress on important subgroups |
| Deterministic output | Probabilistic; same input may produce different output |
| Rollback = redeploy older code | Rollback = redeploy older model, possibly with different data dependencies |
| One artifact: container | Three artifacts: training code, training data, model weights |
| Monitoring: latency, errors | Monitoring: latency, errors, prediction drift, data drift, fairness |
| CI: tests | CI: tests + reproducible training + model validation |
| CD: kubectl apply | CD: shadow → canary → blast-radius rollout |
The data dimension makes everything harder. "Reproducible build" in MLOps means same model from same data and same code — and data is mutable, large, and often distributed.
The Layers of a Modern ML Stack
┌──────────────────────────────────────────────┐
│ Applications: chatbot, recommender, search │ ← what users see
├──────────────────────────────────────────────┤
│ Serving: vLLM, BentoML, Triton, TGI │ ← runs inference
├──────────────────────────────────────────────┤
│ Model registry: MLflow, SageMaker │ ← versioned weights
├──────────────────────────────────────────────┤
│ Training: PyTorch, Ray, Lightning + Kubeflow │ ← produces models
├──────────────────────────────────────────────┤
│ Experiment tracking: MLflow, W&B, Comet │ ← which run was best
├──────────────────────────────────────────────┤
│ Feature store: Feast, Tecton │ ← consistent features
├──────────────────────────────────────────────┤
│ Data: warehouse + lake (Snowflake, S3) │ ← raw and curated data
├──────────────────────────────────────────────┤
│ Compute: GPU nodes, K8s, Ray clusters │ ← horsepower
└──────────────────────────────────────────────┘Most teams don't build this from scratch — they assemble it, or use a managed stack (SageMaker, Vertex AI, Databricks).
The Players
Experiment tracking and model registry
| Tool | Strengths |
|---|---|
| MLflow | Open source; the default for many; tracking + registry + serving |
| Weights & Biases | Best-in-class UX; experiment tracking; sweep tools |
| Comet | Similar to W&B; strong on collaboration |
| Neptune | Strong for research; cheaper at scale |
Training orchestration
| Tool | Best for |
|---|---|
| Kubeflow | K8s-native training pipelines + serving |
| Ray | Distributed training and serving; Pythonic; from UC Berkeley |
| Flyte | Typed pipelines; cross-cloud; Lyft origin |
| Metaflow | Netflix's; great Python ergonomics |
| PyTorch Lightning | Training-loop abstraction; works on Ray, Kubeflow, etc. |
Model serving and inference
| Tool | Specialty |
|---|---|
| vLLM | Fast LLM inference; continuous batching, paged attention |
| TGI (Text Generation Inference) | HuggingFace's; LLM-specialized |
| BentoML | General model serving; Python framework |
| Triton Inference Server | NVIDIA's; multi-framework, GPU-optimized |
| TorchServe | PyTorch-native |
| TensorFlow Serving | TF-native |
| KServe | Kubernetes-native; multi-framework |
| Ray Serve | Distributed inference; same Ray you train on |
LLM-specific infrastructure
| Tool | Purpose |
|---|---|
| Ollama | Run LLMs locally; great for dev |
| LM Studio | Local LLM with a UI |
| vLLM / TGI / TensorRT-LLM | Production LLM serving |
| LangChain / LlamaIndex | RAG and agent orchestration |
| Vector DBs (Vector Databases) | Embedding storage for RAG |
| LangSmith / Langfuse / Helicone | Tracing and observability for LLM apps |
| Promptfoo / Confident AI / Arize Phoenix | Eval frameworks |
| OpenLLM / RayLLM | Self-hosted LLM platforms |
Feature stores
| Tool | Best for |
|---|---|
| Feast | OSS, lightweight; works with any backend |
| Tecton | Commercial; full-featured |
| Hopsworks | OSS feature store + ML platform |
| Built-in (Databricks, SageMaker) | If you're on that stack already |
Managed end-to-end stacks
| Provider | Notes |
|---|---|
| SageMaker (AWS) | Comprehensive; many sub-products; heavy |
| Vertex AI (GCP) | Pipelines, training, serving in one |
| Databricks | Lakehouse + MLflow + serving + Mosaic AI |
| Azure ML | Microsoft's; AzureML SDK |
How to pick:
- Small team, getting started → MLflow + dedicated serving (BentoML or vLLM)
- Already on a cloud → that cloud's managed stack saves operational pain
- LLMs in production → vLLM for serving + LangSmith/Phoenix for observability + Vector DB for RAG
- Heavy compute, GPU clusters → Ray for training/serving + Kubeflow if K8s-first
- Cross-functional ML platform → Databricks (if you can afford it) or assembled MLflow + Ray + Feast
The MLOps Loop
┌─── observe ───┐
▼ │
data → train → register → deploy → serve → predictions ──→ users
│
▼
monitor drift
│
▼
retrain when neededEvery arrow has tools, processes, and failure modes. The discipline is making each arrow reliable, observable, and reversible.
LLM Differences
A traditional ML model (sklearn, XGBoost) is:
- ~MB-scale weights
- Trained once a week, deployed
- Inference latency: milliseconds
- Cost: cheap CPU inference
An LLM is:
- GB-to-TB-scale weights (7B params = 14GB fp16)
- Often not trained by you — base model from HuggingFace + fine-tune or RAG
- Inference latency: seconds (10s of tokens/s output)
- Cost: GPU inference dominates (a single H100 inference instance is $1-3/hour)
- Stateful (KV cache); batching matters enormously
- Probabilistic outputs need eval frameworks, not unit tests
The MLOps stack stretches around both — but LLM serving and observability are genuinely new categories that didn't exist in 2022.
RAG: The Default LLM Pattern
For most production LLM apps, RAG (Retrieval-Augmented Generation) is the answer:
user query → embed → search vector DB → retrieve top-K docs
│
▼
stuff into LLM prompt → generate answerThe LLM doesn't "know" your data; it reads relevant chunks at inference time. Less fine-tuning, less drift risk, easier to update. Standard stack:
- Embedding model: OpenAI ada-3, BGE, Cohere, your own
- Vector DB: see Vector Databases
- Reranking: Cohere, ColBERT, cross-encoders
- LLM: GPT-4, Claude, open weights via vLLM
- Orchestration: LangChain, LlamaIndex, or hand-rolled
For most teams: 80% of "build an AI feature" work is data preparation, retrieval quality, eval, observability. The model itself is increasingly commoditized.
Learning Path
1. Getting Started
Run MLflow locally; track an experiment; register and serve a model; spin up vLLM for an LLM; build a tiny RAG app
2. Patterns
Training pipelines, model versioning, feature stores, A/B testing, canary models, shadow mode, drift detection, RAG patterns
3. Best Practices
Reproducibility, evaluation, monitoring, GPU cost, LLM safety, common pitfalls, when to fine-tune vs RAG vs prompt
When You Don't Need MLOps Infra Yet
Honest cases:
- Your "AI" is a single OpenAI API call from your app. You don't need MLflow yet. Add observability (LangSmith) when you have real traffic.
- You haven't shipped a single ML feature. Pick one use case; ship it ugly; learn what hurts; then invest in platform.
- You're prototyping in a notebook. Notebook + W&B for tracking is fine for a long time.
Build MLOps infra in response to felt pain, not anticipated needs. The category has more vendors than most teams need.
MLOps is data engineering + DevOps + statistics, all at once. Teams that succeed treat it as a platform discipline, not "the data scientist's responsibility." The data scientist trains a model; the platform makes it deployable, observable, recoverable, and economical. Don't ask data scientists to be Kubernetes experts; build the paved path so they can focus on the model. And build the LLM stack with eyes open: it changes weekly, and the right tool today may not be the right tool in six months.