Getting Started
Run MLflow locally, track an experiment, register and serve a model, spin up vLLM for an LLM, build a tiny RAG app
Getting Started
This page walks the modern MLOps loop in two flavors: a classical model (MLflow tracking + serving) and an LLM app (vLLM + a tiny RAG).
Path A: MLflow Locally
pip install mlflow scikit-learn pandas
mlflow ui # in another terminal: http://localhost:5000Train and Track
# train.py
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("iris-classifier")
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
for n in [10, 50, 100, 200]:
with mlflow.start_run():
mlflow.log_param("n_estimators", n)
model = RandomForestClassifier(n_estimators=n, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
mlflow.log_metric("f1", f1_score(y_test, preds, average='weighted'))
mlflow.sklearn.log_model(model, "model", registered_model_name="iris-classifier")python train.pyOpen http://localhost:5000. You see:
- Four runs with different
n_estimators - Metrics and parameters
- Artifacts (the saved model)
- The Models tab shows
iris-classifierwith versions 1-4
This is the basic loop: every experiment is tracked; every model is versioned; you can answer "which run gave us the best F1, with which params, what was the data?"
Promote a Version
import mlflow
client = mlflow.MlflowClient()
# Promote v2 to "Production" stage
client.transition_model_version_stage(
name="iris-classifier",
version=2,
stage="Production",
archive_existing_versions=True
)Now models:/iris-classifier/Production always points to v2. The serving layer doesn't need to redeploy to switch versions — change the alias.
Serve It
mlflow models serve --model-uri models:/iris-classifier/Production --port 5001 --no-condacurl -X POST http://localhost:5001/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
# {"predictions": [0]}Real production serving has more layers (BentoML, Triton, SageMaker, etc.) but MLflow's built-in serve is enough for a starter.
Add a Test Before Promote
Production promotion should require evaluation:
def evaluate_and_promote(model_uri, holdout_X, holdout_y, current_prod_uri):
new = mlflow.sklearn.load_model(model_uri)
new_f1 = f1_score(holdout_y, new.predict(holdout_X), average='weighted')
prod = mlflow.sklearn.load_model(current_prod_uri)
prod_f1 = f1_score(holdout_y, prod.predict(holdout_X), average='weighted')
if new_f1 > prod_f1 + 0.01: # 1% improvement minimum
# promote
...
else:
print(f"New {new_f1:.3f} not better than prod {prod_f1:.3f}; skipping")A real pipeline runs this in CI or Workflow Orchestration. No human decides; metrics decide.
Path B: Serve an LLM with vLLM
vLLM is the high-performance LLM serving engine: continuous batching, paged attention, OpenAI-compatible API.
pip install vllmGPU required for real models. For learning without GPU, use a small model:
# Smallest practical model; runs on a laptop slowly
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3-mini-4k-instruct \
--max-model-len 4096With a real GPU (e.g., one H100):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9It boots at http://localhost:8000, OpenAI-compatible:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain MLOps in one sentence"}],
"max_tokens": 100
}'The throughput vs. naive HuggingFace pipeline(): easily 10-20× higher tokens/second thanks to continuous batching.
Path C: A Tiny RAG App
The default LLM application pattern. We'll use a local LLM (or OpenAI) + a vector store + a small document corpus.
pip install langchain langchain-community langchain-openai chromadb sentence-transformers# rag_app.py
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# 1. Load docs
loader = TextLoader("my_docs.txt")
docs = loader.load()
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 3. Embed and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma")
# 4. LLM (uses OPENAI_API_KEY; or point at your vLLM server)
llm = ChatOpenAI(model="gpt-4o-mini")
# Or local: llm = ChatOpenAI(base_url="http://localhost:8000/v1", api_key="dummy", model="...")
# 5. RAG chain
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}))
print(qa.invoke({"query": "What is MLOps?"}))Five components:
- Documents to retrieve from
- Chunking (chunks of 500 chars with 50 overlap)
- Embedding (turn text → vectors)
- Vector store (Chroma here; in production, see Vector Databases)
- LLM that synthesizes
This is the 80% pattern for production LLM features. Refining each step (chunking strategy, embedding model, retrieval reranking, prompt design) is where the real engineering lives.
Observability for the LLM App
You'll quickly want to see what the LLM is doing. Use Langfuse (open source) or LangSmith:
# Add Langfuse
from langfuse.callback import CallbackHandler
handler = CallbackHandler(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com",
)
result = qa.invoke({"query": "..."}, config={"callbacks": [handler]})Now Langfuse shows: every request, the retrieved chunks, the prompt, the LLM call, the response, latency, token count, cost. This is essential for production debugging.
Equivalent on SageMaker / Vertex / Databricks
Same loop, vendor-managed:
- SageMaker:
Experiments,Model Registry,Pipelines,Endpoints,JumpStartfor LLMs - Vertex AI:
Experiments,Model Registry,Pipelines,Endpoints,Generative AI Studio - Databricks: MLflow built-in,
Mosaic AI Serving,Vector Search
The mental model is the same — what changes is who maintains the platform. Managed for speed; assembled OSS for control and cost.
A GPU on a Budget
Real LLMs need GPU. Options for learning without committing:
- Modal, Replicate, RunPod, Lambda Labs: per-second GPU rentals
- Google Colab Pro: A100 / H100 in a notebook, hourly
- Hugging Face Spaces / Inference Endpoints: hosted, pay-per-use
- OpenAI / Anthropic API: no GPU at all; you pay per token
For development of MLOps pipelines and patterns, you mostly don't need GPU — small models or stubs work for the loop. Save the GPU bill for when you're really doing inference work.
What's Next
- Patterns — training pipelines, feature stores, canary models, drift, RAG patterns
- Best Practices — reproducibility, evaluation, monitoring, GPU cost, safety, pitfalls