Best Practices

Vector databases are deceptively simple — store vectors, get nearest neighbors. The hard parts are choosing the right embedding model, sizing indexes, managing cost, and keeping the system fresh as you re-embed.

Choosing an Embedding Model

The single most-impactful choice. Criteria:

Concern	What matters
Quality	Benchmark on your data, not generic benchmarks
Dimensionality	Storage and compute scale with dims; 768-1536 is sweet spot
Cost	Per-token pricing; you'll embed millions of tokens
Latency	Self-hosted GPU vs SaaS API
Multilinguality	Cross-language? Pick a multilingual model
Domain fit	Code? Legal docs? Some models are domain-specialized
Stability	Vendor deprecating models = re-embed everything

Models in 2026

Model	Dims	Cost	Notes
OpenAI `text-embedding-3-small`	1536	$0.02/M tokens	Default; cheap; good
OpenAI `text-embedding-3-large`	3072 (or shorter)	$0.13/M tokens	Higher quality; pricier
Voyage `voyage-3`	1024	$0.06/M tokens	Quality-leading for retrieval
Cohere `embed-v3`	1024	$0.10/M tokens	Multilingual; reranker pair
BAAI `bge-large-en-v1.5`	1024	Self-host	Open-source; strong baseline
`nomic-embed-text-v1.5`	64-768 (Matryoshka)	Self-host	Configurable dims

Run a small eval: pick 100 representative queries with known relevant docs, compare top-K recall across 2-3 models. Quality differences are real and surprising.

Matryoshka Embeddings

Newer models (OpenAI v3, Nomic) produce Matryoshka embeddings — you can truncate to fewer dimensions with minimal quality loss:

// OpenAI: get a 256-dim version of a 1536-dim embedding
const emb = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 256,    // truncate
});

256-dim is 6× cheaper to store and faster to query than 1536-dim with minor quality loss. Worth testing if cost matters.

Dimensionality Trade-offs

Dims	Storage / vector	Query speed	Quality (typical)
256	1 KB	Fastest	Good baseline (with Matryoshka models)
768	3 KB	Fast	Strong
1536	6 KB	Medium	Strong
3072	12 KB	Slower	Slightly better

Storage matters at scale. 100M vectors × 1536 dims × 4 bytes ≈ 600 GB before indexes. Plan accordingly.

Index Types

Index	Speed	Recall	Build time	Memory
HNSW (Hierarchical Navigable Small World)	Fast	High	Slow	High
IVF (Inverted File)	Fast (after training)	Medium	Fast	Low
IVF + PQ (Product Quantization)	Fast	Lower	Medium	Very low
DiskANN / Vamana	Fast	High	Slow	Very low (disk-resident)
Flat (brute force)	Slow	Perfect	None	High

HNSW is the default for most modern vector DBs — fastest queries, good recall. Build time and memory are the cost.

For >100M vectors, you'll need disk-based options (DiskANN, IVF+PQ on disk) — pure-in-memory HNSW gets expensive.

Tuning knobs:

M (HNSW) — graph connectivity; higher = better recall, more memory. Default 16-32.
ef_construction — quality of build; higher = slower build, better recall. Default 100-200.
ef_search — quality of search; higher = slower query, better recall. Tune per query.

Most teams accept the defaults; revisit when you have measurable quality / latency / cost issues.

Filter Performance

Pre-filtering (apply filter, then ANN) vs post-filtering (ANN, then filter):

	Pre-filter	Post-filter
Sparse filter (1% pass)	Very slow — degenerates to brute force	Fast but might return too few
Dense filter (50% pass)	Fast	Fast
Most modern DBs	Hybrid: filter index intersected with ANN	Best of both

For very selective filters (per-user data), some DBs (Qdrant, pgvector with partial indexes) handle this well; others struggle. Test with realistic filter selectivity.

Multi-Tenancy

Same patterns as Search:

Approach	When
One collection per tenant	Few tenants, big data each; clean isolation
Shared collection + `tenant_id` filter	Many tenants; smaller data each
One DB per tenant	High security; expensive

For SaaS with thousands of tenants: shared collection, tenant_id filter, index that field. For very large tenants (enterprise customers with millions of vectors each): one collection per tenant.

Re-Embedding

You'll need to re-embed when:

Switching embedding models (vendor deprecation, quality upgrade).
Adding new fields to the embedded text.
The embedding model improves and you want the gains.

Plan for this from day one: store the embedding's model name + version with each vector. When re-embedding, do it in batches alongside the live index (Blue-green pattern):

1. Create new collection `products_v2`
2. Backfill: for each row, generate new embedding, upsert to v2
3. Validate quality on the new collection
4. Cut over reads (config flag)
5. Drop old collection

For embedding generation at scale, use the provider's batch API (50% cheaper at OpenAI, similar with others) — it's not real-time but huge for backfills.

Cost Management

Two cost axes:

Axis	Optimization
Embedding generation	Batch API (50% off); shorter chunks; smaller models
Storage	Lower dims (Matryoshka); compress with PQ; archive cold data
Query	Cache common queries; reduce `ef_search`; pre-filter aggressively
Self-host vs SaaS	At >10M vectors, often cheaper to self-host

Real example: 10M docs × 4 chunks each = 40M embeddings. At 1536 dim × 4 bytes = 240 GB. Plus HNSW index overhead (~1.5×) = ~400 GB working set.

Pinecone Serverless: ~$500-1500/month at this size.
Qdrant self-host on a 256 GB RAM machine: ~$200-400/month compute + ops time.
pgvector on managed Postgres with pgvector plus a beefy instance: similar.

Costs scale roughly linearly with vector count × dimensions. Pick the model dim before you embed everything.

Observability

Signal	Why
Query latency p99	Especially with filter selectivity changes
Recall on a held-out eval set	Are results getting worse?
Embedding API latency	Often the bottleneck of the whole RAG loop
Embedding API errors / rate limits	Plan retry / fallback
Index build time	When you add many docs at once
Disk / memory usage	Vector DBs are storage-bound; alert before full
Top-K result similarity scores	Distribution drift means model or content changed

For RAG specifically, build an evaluation harness — a set of queries with expected answers. Run weekly. Score with RAGAS, LangSmith, or a custom rubric. Regressions are silent without it.

Stale Embeddings

If you embed user-generated content, you have a lifecycle issue: edits, deletions, re-classifications.

On edit: re-embed and upsert.
On delete: delete from the index (and your DB).
On schema change (new fields in embedded text): re-embed.

The flow:

App write to DB ──► CDC / outbox ──► embedding worker ──► upsert to vector DB

Same pattern as keeping any derived index in sync with a source-of-truth DB. See Background Jobs for the queue layer.

Security

API keys for embedding models in secret manager — see Secrets.
Per-tenant filters at the query layer — never trust client-supplied tenant_id.
Embeddings can leak content — though hard to reverse, they encode source information. Treat them as sensitive data.
Don't embed PII you don't need in the search context. Embed identifiers; resolve to PII per-query in your app.

Common Pitfalls

Pitfall	Symptom	Fix
Different embedding models for store vs query	All scores near zero	Same model both sides
Forgetting to normalize	Dot product scores look wrong	Use cosine metric, or pre-normalize
Embedding raw HTML / markdown	Tokens wasted on `<div>` etc	Strip / clean text before embedding
One vector per giant document	Poor retrieval	Chunk; ~500-1000 tokens per chunk
Filter on un-indexed field	Slow queries	Index filter fields
Vector-only when keyword would win	Bad results for exact-match queries	Hybrid search
No eval set	Silent quality regressions	Maintain a labeled eval set
Embedding everything ad-hoc	Costs balloon, no consistency	Centralize embedding logic in one service
Letting LLM make up citations	Hallucinated answers	Validate citations exist in retrieved docs

Checklist

Best Practices

Choosing an Embedding Model

Models in 2026

Matryoshka Embeddings

Dimensionality Trade-offs

Index Types

Filter Performance

Multi-Tenancy

Re-Embedding

Cost Management

Observability

Stale Embeddings

Security

Common Pitfalls

Checklist

Best Practices

On this page