Steven's Knowledge

Getting Started

Run MLflow locally, track an experiment, register and serve a model, spin up vLLM for an LLM, build a tiny RAG app

Getting Started

This page walks the modern MLOps loop in two flavors: a classical model (MLflow tracking + serving) and an LLM app (vLLM + a tiny RAG).

Path A: MLflow Locally

pip install mlflow scikit-learn pandas
mlflow ui   # in another terminal: http://localhost:5000

Train and Track

# train.py
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("iris-classifier")

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

for n in [10, 50, 100, 200]:
    with mlflow.start_run():
        mlflow.log_param("n_estimators", n)
        model = RandomForestClassifier(n_estimators=n, random_state=42)
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
        mlflow.log_metric("f1", f1_score(y_test, preds, average='weighted'))
        mlflow.sklearn.log_model(model, "model", registered_model_name="iris-classifier")
python train.py

Open http://localhost:5000. You see:

  • Four runs with different n_estimators
  • Metrics and parameters
  • Artifacts (the saved model)
  • The Models tab shows iris-classifier with versions 1-4

This is the basic loop: every experiment is tracked; every model is versioned; you can answer "which run gave us the best F1, with which params, what was the data?"

Promote a Version

import mlflow
client = mlflow.MlflowClient()
# Promote v2 to "Production" stage
client.transition_model_version_stage(
    name="iris-classifier",
    version=2,
    stage="Production",
    archive_existing_versions=True
)

Now models:/iris-classifier/Production always points to v2. The serving layer doesn't need to redeploy to switch versions — change the alias.

Serve It

mlflow models serve --model-uri models:/iris-classifier/Production --port 5001 --no-conda
curl -X POST http://localhost:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
# {"predictions": [0]}

Real production serving has more layers (BentoML, Triton, SageMaker, etc.) but MLflow's built-in serve is enough for a starter.

Add a Test Before Promote

Production promotion should require evaluation:

def evaluate_and_promote(model_uri, holdout_X, holdout_y, current_prod_uri):
    new = mlflow.sklearn.load_model(model_uri)
    new_f1 = f1_score(holdout_y, new.predict(holdout_X), average='weighted')

    prod = mlflow.sklearn.load_model(current_prod_uri)
    prod_f1 = f1_score(holdout_y, prod.predict(holdout_X), average='weighted')

    if new_f1 > prod_f1 + 0.01:   # 1% improvement minimum
        # promote
        ...
    else:
        print(f"New {new_f1:.3f} not better than prod {prod_f1:.3f}; skipping")

A real pipeline runs this in CI or Workflow Orchestration. No human decides; metrics decide.

Path B: Serve an LLM with vLLM

vLLM is the high-performance LLM serving engine: continuous batching, paged attention, OpenAI-compatible API.

pip install vllm

GPU required for real models. For learning without GPU, use a small model:

# Smallest practical model; runs on a laptop slowly
python -m vllm.entrypoints.openai.api_server \
  --model microsoft/Phi-3-mini-4k-instruct \
  --max-model-len 4096

With a real GPU (e.g., one H100):

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9

It boots at http://localhost:8000, OpenAI-compatible:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain MLOps in one sentence"}],
    "max_tokens": 100
  }'

The throughput vs. naive HuggingFace pipeline(): easily 10-20× higher tokens/second thanks to continuous batching.

Path C: A Tiny RAG App

The default LLM application pattern. We'll use a local LLM (or OpenAI) + a vector store + a small document corpus.

pip install langchain langchain-community langchain-openai chromadb sentence-transformers
# rag_app.py
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# 1. Load docs
loader = TextLoader("my_docs.txt")
docs = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma")

# 4. LLM (uses OPENAI_API_KEY; or point at your vLLM server)
llm = ChatOpenAI(model="gpt-4o-mini")
# Or local: llm = ChatOpenAI(base_url="http://localhost:8000/v1", api_key="dummy", model="...")

# 5. RAG chain
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}))

print(qa.invoke({"query": "What is MLOps?"}))

Five components:

  1. Documents to retrieve from
  2. Chunking (chunks of 500 chars with 50 overlap)
  3. Embedding (turn text → vectors)
  4. Vector store (Chroma here; in production, see Vector Databases)
  5. LLM that synthesizes

This is the 80% pattern for production LLM features. Refining each step (chunking strategy, embedding model, retrieval reranking, prompt design) is where the real engineering lives.

Observability for the LLM App

You'll quickly want to see what the LLM is doing. Use Langfuse (open source) or LangSmith:

# Add Langfuse
from langfuse.callback import CallbackHandler

handler = CallbackHandler(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",
)

result = qa.invoke({"query": "..."}, config={"callbacks": [handler]})

Now Langfuse shows: every request, the retrieved chunks, the prompt, the LLM call, the response, latency, token count, cost. This is essential for production debugging.

Equivalent on SageMaker / Vertex / Databricks

Same loop, vendor-managed:

  • SageMaker: Experiments, Model Registry, Pipelines, Endpoints, JumpStart for LLMs
  • Vertex AI: Experiments, Model Registry, Pipelines, Endpoints, Generative AI Studio
  • Databricks: MLflow built-in, Mosaic AI Serving, Vector Search

The mental model is the same — what changes is who maintains the platform. Managed for speed; assembled OSS for control and cost.

A GPU on a Budget

Real LLMs need GPU. Options for learning without committing:

  • Modal, Replicate, RunPod, Lambda Labs: per-second GPU rentals
  • Google Colab Pro: A100 / H100 in a notebook, hourly
  • Hugging Face Spaces / Inference Endpoints: hosted, pay-per-use
  • OpenAI / Anthropic API: no GPU at all; you pay per token

For development of MLOps pipelines and patterns, you mostly don't need GPU — small models or stubs work for the loop. Save the GPU bill for when you're really doing inference work.

What's Next

  • Patterns — training pipelines, feature stores, canary models, drift, RAG patterns
  • Best Practices — reproducibility, evaluation, monitoring, GPU cost, safety, pitfalls

On this page