Steven's Knowledge

Best Practices

Production vector databases - embedding models, dimensions, indexes, metadata, cost, observability

Best Practices

Vector databases are deceptively simple — store vectors, get nearest neighbors. The hard parts are choosing the right embedding model, sizing indexes, managing cost, and keeping the system fresh as you re-embed.

Choosing an Embedding Model

The single most-impactful choice. Criteria:

ConcernWhat matters
QualityBenchmark on your data, not generic benchmarks
DimensionalityStorage and compute scale with dims; 768-1536 is sweet spot
CostPer-token pricing; you'll embed millions of tokens
LatencySelf-hosted GPU vs SaaS API
MultilingualityCross-language? Pick a multilingual model
Domain fitCode? Legal docs? Some models are domain-specialized
StabilityVendor deprecating models = re-embed everything

Models in 2026

ModelDimsCostNotes
OpenAI text-embedding-3-small1536$0.02/M tokensDefault; cheap; good
OpenAI text-embedding-3-large3072 (or shorter)$0.13/M tokensHigher quality; pricier
Voyage voyage-31024$0.06/M tokensQuality-leading for retrieval
Cohere embed-v31024$0.10/M tokensMultilingual; reranker pair
BAAI bge-large-en-v1.51024Self-hostOpen-source; strong baseline
nomic-embed-text-v1.564-768 (Matryoshka)Self-hostConfigurable dims

Run a small eval: pick 100 representative queries with known relevant docs, compare top-K recall across 2-3 models. Quality differences are real and surprising.

Matryoshka Embeddings

Newer models (OpenAI v3, Nomic) produce Matryoshka embeddings — you can truncate to fewer dimensions with minimal quality loss:

// OpenAI: get a 256-dim version of a 1536-dim embedding
const emb = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 256,    // truncate
});

256-dim is 6× cheaper to store and faster to query than 1536-dim with minor quality loss. Worth testing if cost matters.

Dimensionality Trade-offs

DimsStorage / vectorQuery speedQuality (typical)
2561 KBFastestGood baseline (with Matryoshka models)
7683 KBFastStrong
15366 KBMediumStrong
307212 KBSlowerSlightly better

Storage matters at scale. 100M vectors × 1536 dims × 4 bytes ≈ 600 GB before indexes. Plan accordingly.

Index Types

IndexSpeedRecallBuild timeMemory
HNSW (Hierarchical Navigable Small World)FastHighSlowHigh
IVF (Inverted File)Fast (after training)MediumFastLow
IVF + PQ (Product Quantization)FastLowerMediumVery low
DiskANN / VamanaFastHighSlowVery low (disk-resident)
Flat (brute force)SlowPerfectNoneHigh

HNSW is the default for most modern vector DBs — fastest queries, good recall. Build time and memory are the cost.

For >100M vectors, you'll need disk-based options (DiskANN, IVF+PQ on disk) — pure-in-memory HNSW gets expensive.

Tuning knobs:

  • M (HNSW) — graph connectivity; higher = better recall, more memory. Default 16-32.
  • ef_construction — quality of build; higher = slower build, better recall. Default 100-200.
  • ef_search — quality of search; higher = slower query, better recall. Tune per query.

Most teams accept the defaults; revisit when you have measurable quality / latency / cost issues.

Filter Performance

Pre-filtering (apply filter, then ANN) vs post-filtering (ANN, then filter):

Pre-filterPost-filter
Sparse filter (1% pass)Very slow — degenerates to brute forceFast but might return too few
Dense filter (50% pass)FastFast
Most modern DBsHybrid: filter index intersected with ANNBest of both

For very selective filters (per-user data), some DBs (Qdrant, pgvector with partial indexes) handle this well; others struggle. Test with realistic filter selectivity.

Multi-Tenancy

Same patterns as Search:

ApproachWhen
One collection per tenantFew tenants, big data each; clean isolation
Shared collection + tenant_id filterMany tenants; smaller data each
One DB per tenantHigh security; expensive

For SaaS with thousands of tenants: shared collection, tenant_id filter, index that field. For very large tenants (enterprise customers with millions of vectors each): one collection per tenant.

Re-Embedding

You'll need to re-embed when:

  • Switching embedding models (vendor deprecation, quality upgrade).
  • Adding new fields to the embedded text.
  • The embedding model improves and you want the gains.

Plan for this from day one: store the embedding's model name + version with each vector. When re-embedding, do it in batches alongside the live index (Blue-green pattern):

1. Create new collection `products_v2`
2. Backfill: for each row, generate new embedding, upsert to v2
3. Validate quality on the new collection
4. Cut over reads (config flag)
5. Drop old collection

For embedding generation at scale, use the provider's batch API (50% cheaper at OpenAI, similar with others) — it's not real-time but huge for backfills.

Cost Management

Two cost axes:

AxisOptimization
Embedding generationBatch API (50% off); shorter chunks; smaller models
StorageLower dims (Matryoshka); compress with PQ; archive cold data
QueryCache common queries; reduce ef_search; pre-filter aggressively
Self-host vs SaaSAt >10M vectors, often cheaper to self-host

Real example: 10M docs × 4 chunks each = 40M embeddings. At 1536 dim × 4 bytes = 240 GB. Plus HNSW index overhead (~1.5×) = ~400 GB working set.

  • Pinecone Serverless: ~$500-1500/month at this size.
  • Qdrant self-host on a 256 GB RAM machine: ~$200-400/month compute + ops time.
  • pgvector on managed Postgres with pgvector plus a beefy instance: similar.

Costs scale roughly linearly with vector count × dimensions. Pick the model dim before you embed everything.

Observability

SignalWhy
Query latency p99Especially with filter selectivity changes
Recall on a held-out eval setAre results getting worse?
Embedding API latencyOften the bottleneck of the whole RAG loop
Embedding API errors / rate limitsPlan retry / fallback
Index build timeWhen you add many docs at once
Disk / memory usageVector DBs are storage-bound; alert before full
Top-K result similarity scoresDistribution drift means model or content changed

For RAG specifically, build an evaluation harness — a set of queries with expected answers. Run weekly. Score with RAGAS, LangSmith, or a custom rubric. Regressions are silent without it.

Stale Embeddings

If you embed user-generated content, you have a lifecycle issue: edits, deletions, re-classifications.

  • On edit: re-embed and upsert.
  • On delete: delete from the index (and your DB).
  • On schema change (new fields in embedded text): re-embed.

The flow:

App write to DB ──► CDC / outbox ──► embedding worker ──► upsert to vector DB

Same pattern as keeping any derived index in sync with a source-of-truth DB. See Background Jobs for the queue layer.

Security

  • API keys for embedding models in secret manager — see Secrets.
  • Per-tenant filters at the query layer — never trust client-supplied tenant_id.
  • Embeddings can leak content — though hard to reverse, they encode source information. Treat them as sensitive data.
  • Don't embed PII you don't need in the search context. Embed identifiers; resolve to PII per-query in your app.

Common Pitfalls

PitfallSymptomFix
Different embedding models for store vs queryAll scores near zeroSame model both sides
Forgetting to normalizeDot product scores look wrongUse cosine metric, or pre-normalize
Embedding raw HTML / markdownTokens wasted on <div> etcStrip / clean text before embedding
One vector per giant documentPoor retrievalChunk; ~500-1000 tokens per chunk
Filter on un-indexed fieldSlow queriesIndex filter fields
Vector-only when keyword would winBad results for exact-match queriesHybrid search
No eval setSilent quality regressionsMaintain a labeled eval set
Embedding everything ad-hocCosts balloon, no consistencyCentralize embedding logic in one service
Letting LLM make up citationsHallucinated answersValidate citations exist in retrieved docs

Checklist

Production vector database checklist

  • Embedding model chosen via eval, not just defaults
  • Same model used for store and query
  • Model name + version stored with vectors (for future re-embedding)
  • Chunking strategy appropriate to content (size, overlap, semantic boundaries)
  • Cosine distance with normalized vectors (or dot-product with pre-normalized)
  • Index type chosen (HNSW for most; DiskANN for hyperscale)
  • Metadata filters use indexed fields
  • Hybrid search (vector + keyword) for retrieval, not pure vector
  • Reranking for high-quality results
  • Per-tenant filters enforced server-side
  • Eval set + regression testing for RAG quality
  • Embedding API failures handled with retries / fallback
  • Re-embedding plan for model upgrades (blue-green)
  • Storage and query latency monitored
  • Cost monitoring (embedding tokens, vector storage)
  • Embeddings treated as sensitive data
  • Citations validated; LLM not trusted to make them up

On this page