Steven's Knowledge

Embeddings & Vector Spaces

The lingua franca of modern AI — how meaning becomes geometry

Embeddings turn discrete things — words, sentences, images, users, products — into vectors in a continuous space, where semantic similarity becomes geometric proximity. They are the connective tissue between models, search systems, recommendation engines, and retrieval pipelines.

What Makes a Good Embedding

A good embedding space has a simple property: things that are similar in the way you care about end up close together. "King" and "queen" are close. The Eiffel Tower photo and the word "Paris" can be close in a multimodal space. Two paraphrased questions land in roughly the same neighborhood.

How They're Trained

Most modern embeddings come from neural networks trained with contrastive objectives: pull positive pairs together, push negative pairs apart. The training data is what defines what "similar" means — code search embeddings know that two functions doing the same thing are similar; instruction-following embeddings know that a question and its answer are related.

Once you have embeddings, retrieval becomes nearest-neighbor search:

  • Exact search — works fine up to ~10K vectors.
  • Approximate nearest neighbor (ANN) — HNSW, IVF, scalar/product quantization. Standard above that.
  • Vector databases — Pinecone, Weaviate, Qdrant, pgvector. All wrap ANN indexes with operational concerns: filtering, updates, replication.

Distance Metrics

  • Cosine similarity — measures angle, not magnitude. The default for normalized embeddings.
  • Dot product — same as cosine if vectors are unit-normalized.
  • Euclidean distance — sometimes used; usually equivalent to cosine after normalization.

Common Pitfalls

  • Mixing models — embeddings from different models live in different spaces. Don't compare them.
  • Stale indexes — when the embedding model updates, the entire index has to be rebuilt.
  • Dimension cost — higher-dimensional embeddings cost more to store and search; pick the smallest dimension that meets your quality bar.
  • Domain drift — a general-purpose embedding model may underperform on specialized vocabulary (legal, medical, code).

Why It Matters

Almost every "AI feature" with a memory or a corpus runs on embeddings underneath: RAG, semantic search, dedup, recommendations, clustering, classification. Get the embedding layer right and the rest of the stack gets easier.

On this page