Steven's Knowledge

MLOps & AI Infrastructure

MLflow, Kubeflow, Ray, BentoML, vLLM, SageMaker - training, serving, monitoring, and lifecycle for ML and AI workloads

MLOps & AI Infrastructure

MLOps is the discipline of running machine learning systems in production: training models reliably, serving them with low latency, monitoring drift, rolling them out safely, rolling them back when they break. AI infrastructure is the underlying platform — feature stores, model registries, GPU schedulers, inference servers — that makes it possible.

The category exploded in 2022-2024 as large language models moved from research toys to production. The pre-LLM playbook (train, register, deploy a sklearn model) still applies but is dwarfed by the LLM playbook (RAG, fine-tuning, inference optimization, prompt management, evaluation harnesses).

Why MLOps Is Different from DevOps

DevOpsMLOps
Code in GitCode + data + models — all versioned
Tests pass = shipMetrics must improve over baseline — and not regress on important subgroups
Deterministic outputProbabilistic; same input may produce different output
Rollback = redeploy older codeRollback = redeploy older model, possibly with different data dependencies
One artifact: containerThree artifacts: training code, training data, model weights
Monitoring: latency, errorsMonitoring: latency, errors, prediction drift, data drift, fairness
CI: testsCI: tests + reproducible training + model validation
CD: kubectl applyCD: shadow → canary → blast-radius rollout

The data dimension makes everything harder. "Reproducible build" in MLOps means same model from same data and same code — and data is mutable, large, and often distributed.

The Layers of a Modern ML Stack

┌──────────────────────────────────────────────┐
│ Applications: chatbot, recommender, search   │ ← what users see
├──────────────────────────────────────────────┤
│ Serving: vLLM, BentoML, Triton, TGI          │ ← runs inference
├──────────────────────────────────────────────┤
│ Model registry: MLflow, SageMaker            │ ← versioned weights
├──────────────────────────────────────────────┤
│ Training: PyTorch, Ray, Lightning + Kubeflow │ ← produces models
├──────────────────────────────────────────────┤
│ Experiment tracking: MLflow, W&B, Comet      │ ← which run was best
├──────────────────────────────────────────────┤
│ Feature store: Feast, Tecton                 │ ← consistent features
├──────────────────────────────────────────────┤
│ Data: warehouse + lake (Snowflake, S3)       │ ← raw and curated data
├──────────────────────────────────────────────┤
│ Compute: GPU nodes, K8s, Ray clusters        │ ← horsepower
└──────────────────────────────────────────────┘

Most teams don't build this from scratch — they assemble it, or use a managed stack (SageMaker, Vertex AI, Databricks).

The Players

Experiment tracking and model registry

ToolStrengths
MLflowOpen source; the default for many; tracking + registry + serving
Weights & BiasesBest-in-class UX; experiment tracking; sweep tools
CometSimilar to W&B; strong on collaboration
NeptuneStrong for research; cheaper at scale

Training orchestration

ToolBest for
KubeflowK8s-native training pipelines + serving
RayDistributed training and serving; Pythonic; from UC Berkeley
FlyteTyped pipelines; cross-cloud; Lyft origin
MetaflowNetflix's; great Python ergonomics
PyTorch LightningTraining-loop abstraction; works on Ray, Kubeflow, etc.

Model serving and inference

ToolSpecialty
vLLMFast LLM inference; continuous batching, paged attention
TGI (Text Generation Inference)HuggingFace's; LLM-specialized
BentoMLGeneral model serving; Python framework
Triton Inference ServerNVIDIA's; multi-framework, GPU-optimized
TorchServePyTorch-native
TensorFlow ServingTF-native
KServeKubernetes-native; multi-framework
Ray ServeDistributed inference; same Ray you train on

LLM-specific infrastructure

ToolPurpose
OllamaRun LLMs locally; great for dev
LM StudioLocal LLM with a UI
vLLM / TGI / TensorRT-LLMProduction LLM serving
LangChain / LlamaIndexRAG and agent orchestration
Vector DBs (Vector Databases)Embedding storage for RAG
LangSmith / Langfuse / HeliconeTracing and observability for LLM apps
Promptfoo / Confident AI / Arize PhoenixEval frameworks
OpenLLM / RayLLMSelf-hosted LLM platforms

Feature stores

ToolBest for
FeastOSS, lightweight; works with any backend
TectonCommercial; full-featured
HopsworksOSS feature store + ML platform
Built-in (Databricks, SageMaker)If you're on that stack already

Managed end-to-end stacks

ProviderNotes
SageMaker (AWS)Comprehensive; many sub-products; heavy
Vertex AI (GCP)Pipelines, training, serving in one
DatabricksLakehouse + MLflow + serving + Mosaic AI
Azure MLMicrosoft's; AzureML SDK

How to pick:

  • Small team, getting startedMLflow + dedicated serving (BentoML or vLLM)
  • Already on a cloud → that cloud's managed stack saves operational pain
  • LLMs in productionvLLM for serving + LangSmith/Phoenix for observability + Vector DB for RAG
  • Heavy compute, GPU clustersRay for training/serving + Kubeflow if K8s-first
  • Cross-functional ML platformDatabricks (if you can afford it) or assembled MLflow + Ray + Feast

The MLOps Loop

                   ┌─── observe ───┐
                   ▼                │
data → train → register → deploy → serve → predictions ──→ users


                                       monitor drift


                                       retrain when needed

Every arrow has tools, processes, and failure modes. The discipline is making each arrow reliable, observable, and reversible.

LLM Differences

A traditional ML model (sklearn, XGBoost) is:

  • ~MB-scale weights
  • Trained once a week, deployed
  • Inference latency: milliseconds
  • Cost: cheap CPU inference

An LLM is:

  • GB-to-TB-scale weights (7B params = 14GB fp16)
  • Often not trained by you — base model from HuggingFace + fine-tune or RAG
  • Inference latency: seconds (10s of tokens/s output)
  • Cost: GPU inference dominates (a single H100 inference instance is $1-3/hour)
  • Stateful (KV cache); batching matters enormously
  • Probabilistic outputs need eval frameworks, not unit tests

The MLOps stack stretches around both — but LLM serving and observability are genuinely new categories that didn't exist in 2022.

RAG: The Default LLM Pattern

For most production LLM apps, RAG (Retrieval-Augmented Generation) is the answer:

user query → embed → search vector DB → retrieve top-K docs


                          stuff into LLM prompt → generate answer

The LLM doesn't "know" your data; it reads relevant chunks at inference time. Less fine-tuning, less drift risk, easier to update. Standard stack:

  • Embedding model: OpenAI ada-3, BGE, Cohere, your own
  • Vector DB: see Vector Databases
  • Reranking: Cohere, ColBERT, cross-encoders
  • LLM: GPT-4, Claude, open weights via vLLM
  • Orchestration: LangChain, LlamaIndex, or hand-rolled

For most teams: 80% of "build an AI feature" work is data preparation, retrieval quality, eval, observability. The model itself is increasingly commoditized.

Learning Path

When You Don't Need MLOps Infra Yet

Honest cases:

  • Your "AI" is a single OpenAI API call from your app. You don't need MLflow yet. Add observability (LangSmith) when you have real traffic.
  • You haven't shipped a single ML feature. Pick one use case; ship it ugly; learn what hurts; then invest in platform.
  • You're prototyping in a notebook. Notebook + W&B for tracking is fine for a long time.

Build MLOps infra in response to felt pain, not anticipated needs. The category has more vendors than most teams need.

MLOps is data engineering + DevOps + statistics, all at once. Teams that succeed treat it as a platform discipline, not "the data scientist's responsibility." The data scientist trains a model; the platform makes it deployable, observable, recoverable, and economical. Don't ask data scientists to be Kubernetes experts; build the paved path so they can focus on the model. And build the LLM stack with eyes open: it changes weekly, and the right tool today may not be the right tool in six months.

On this page