MLflow, Kubeflow, Ray, BentoML, vLLM, SageMaker - training, serving, monitoring, and lifecycle for ML and AI workloads

MLOps & AI Infrastructure

MLOps is the discipline of running machine learning systems in production: training models reliably, serving them with low latency, monitoring drift, rolling them out safely, rolling them back when they break. AI infrastructure is the underlying platform — feature stores, model registries, GPU schedulers, inference servers — that makes it possible.

The category exploded in 2022-2024 as large language models moved from research toys to production. The pre-LLM playbook (train, register, deploy a sklearn model) still applies but is dwarfed by the LLM playbook (RAG, fine-tuning, inference optimization, prompt management, evaluation harnesses).

Why MLOps Is Different from DevOps

DevOps	MLOps
Code in Git	Code + data + models — all versioned
Tests pass = ship	Metrics must improve over baseline — and not regress on important subgroups
Deterministic output	Probabilistic; same input may produce different output
Rollback = redeploy older code	Rollback = redeploy older model, possibly with different data dependencies
One artifact: container	Three artifacts: training code, training data, model weights
Monitoring: latency, errors	Monitoring: latency, errors, prediction drift, data drift, fairness
CI: tests	CI: tests + reproducible training + model validation
CD: kubectl apply	CD: shadow → canary → blast-radius rollout

The data dimension makes everything harder. "Reproducible build" in MLOps means same model from same data and same code — and data is mutable, large, and often distributed.

The Layers of a Modern ML Stack

┌──────────────────────────────────────────────┐
│ Applications: chatbot, recommender, search   │ ← what users see
├──────────────────────────────────────────────┤
│ Serving: vLLM, BentoML, Triton, TGI          │ ← runs inference
├──────────────────────────────────────────────┤
│ Model registry: MLflow, SageMaker            │ ← versioned weights
├──────────────────────────────────────────────┤
│ Training: PyTorch, Ray, Lightning + Kubeflow │ ← produces models
├──────────────────────────────────────────────┤
│ Experiment tracking: MLflow, W&B, Comet      │ ← which run was best
├──────────────────────────────────────────────┤
│ Feature store: Feast, Tecton                 │ ← consistent features
├──────────────────────────────────────────────┤
│ Data: warehouse + lake (Snowflake, S3)       │ ← raw and curated data
├──────────────────────────────────────────────┤
│ Compute: GPU nodes, K8s, Ray clusters        │ ← horsepower
└──────────────────────────────────────────────┘

Most teams don't build this from scratch — they assemble it, or use a managed stack (SageMaker, Vertex AI, Databricks).

The Players

Experiment tracking and model registry

Tool	Strengths
MLflow	Open source; the default for many; tracking + registry + serving
Weights & Biases	Best-in-class UX; experiment tracking; sweep tools
Comet	Similar to W&B; strong on collaboration
Neptune	Strong for research; cheaper at scale

Training orchestration

Tool	Best for
Kubeflow	K8s-native training pipelines + serving
Ray	Distributed training and serving; Pythonic; from UC Berkeley
Flyte	Typed pipelines; cross-cloud; Lyft origin
Metaflow	Netflix's; great Python ergonomics
PyTorch Lightning	Training-loop abstraction; works on Ray, Kubeflow, etc.

Model serving and inference

Tool	Specialty
vLLM	Fast LLM inference; continuous batching, paged attention
TGI (Text Generation Inference)	HuggingFace's; LLM-specialized
BentoML	General model serving; Python framework
Triton Inference Server	NVIDIA's; multi-framework, GPU-optimized
TorchServe	PyTorch-native
TensorFlow Serving	TF-native
KServe	Kubernetes-native; multi-framework
Ray Serve	Distributed inference; same Ray you train on

LLM-specific infrastructure

Tool	Purpose
Ollama	Run LLMs locally; great for dev
LM Studio	Local LLM with a UI
vLLM / TGI / TensorRT-LLM	Production LLM serving
LangChain / LlamaIndex	RAG and agent orchestration
Vector DBs (Vector Databases)	Embedding storage for RAG
LangSmith / Langfuse / Helicone	Tracing and observability for LLM apps
Promptfoo / Confident AI / Arize Phoenix	Eval frameworks
OpenLLM / RayLLM	Self-hosted LLM platforms

Feature stores

Tool	Best for
Feast	OSS, lightweight; works with any backend
Tecton	Commercial; full-featured
Hopsworks	OSS feature store + ML platform
Built-in (Databricks, SageMaker)	If you're on that stack already

Managed end-to-end stacks

Provider	Notes
SageMaker (AWS)	Comprehensive; many sub-products; heavy
Vertex AI (GCP)	Pipelines, training, serving in one
Databricks	Lakehouse + MLflow + serving + Mosaic AI
Azure ML	Microsoft's; AzureML SDK

How to pick:

Small team, getting started → MLflow + dedicated serving (BentoML or vLLM)
Already on a cloud → that cloud's managed stack saves operational pain
LLMs in production → vLLM for serving + LangSmith/Phoenix for observability + Vector DB for RAG
Heavy compute, GPU clusters → Ray for training/serving + Kubeflow if K8s-first
Cross-functional ML platform → Databricks (if you can afford it) or assembled MLflow + Ray + Feast

The MLOps Loop

                   ┌─── observe ───┐
                   ▼                │
data → train → register → deploy → serve → predictions ──→ users
                                            │
                                            ▼
                                       monitor drift
                                            │
                                            ▼
                                       retrain when needed

Every arrow has tools, processes, and failure modes. The discipline is making each arrow reliable, observable, and reversible.

LLM Differences

A traditional ML model (sklearn, XGBoost) is:

~MB-scale weights
Trained once a week, deployed
Inference latency: milliseconds
Cost: cheap CPU inference

An LLM is:

GB-to-TB-scale weights (7B params = 14GB fp16)
Often not trained by you — base model from HuggingFace + fine-tune or RAG
Inference latency: seconds (10s of tokens/s output)
Cost: GPU inference dominates (a single H100 inference instance is $1-3/hour)
Stateful (KV cache); batching matters enormously
Probabilistic outputs need eval frameworks, not unit tests

The MLOps stack stretches around both — but LLM serving and observability are genuinely new categories that didn't exist in 2022.

RAG: The Default LLM Pattern

For most production LLM apps, RAG (Retrieval-Augmented Generation) is the answer:

user query → embed → search vector DB → retrieve top-K docs
                                              │
                                              ▼
                          stuff into LLM prompt → generate answer

The LLM doesn't "know" your data; it reads relevant chunks at inference time. Less fine-tuning, less drift risk, easier to update. Standard stack:

Embedding model: OpenAI ada-3, BGE, Cohere, your own
Vector DB: see Vector Databases
Reranking: Cohere, ColBERT, cross-encoders
LLM: GPT-4, Claude, open weights via vLLM
Orchestration: LangChain, LlamaIndex, or hand-rolled

For most teams: 80% of "build an AI feature" work is data preparation, retrieval quality, eval, observability. The model itself is increasingly commoditized.

Learning Path

1. Getting Started

Run MLflow locally; track an experiment; register and serve a model; spin up vLLM for an LLM; build a tiny RAG app

2. Patterns

Training pipelines, model versioning, feature stores, A/B testing, canary models, shadow mode, drift detection, RAG patterns

3. Best Practices

Reproducibility, evaluation, monitoring, GPU cost, LLM safety, common pitfalls, when to fine-tune vs RAG vs prompt

When You Don't Need MLOps Infra Yet

Honest cases:

Your "AI" is a single OpenAI API call from your app. You don't need MLflow yet. Add observability (LangSmith) when you have real traffic.
You haven't shipped a single ML feature. Pick one use case; ship it ugly; learn what hurts; then invest in platform.
You're prototyping in a notebook. Notebook + W&B for tracking is fine for a long time.

Build MLOps infra in response to felt pain, not anticipated needs. The category has more vendors than most teams need.

MLOps is data engineering + DevOps + statistics, all at once. Teams that succeed treat it as a platform discipline, not "the data scientist's responsibility." The data scientist trains a model; the platform makes it deployable, observable, recoverable, and economical. Don't ask data scientists to be Kubernetes experts; build the paved path so they can focus on the model. And build the LLM stack with eyes open: it changes weekly, and the right tool today may not be the right tool in six months.

MLOps & AI Infrastructure

1. Getting Started

2. Patterns

3. Best Practices

On this page