Best Practices
Production vector databases - embedding models, dimensions, indexes, metadata, cost, observability
Best Practices
Vector databases are deceptively simple — store vectors, get nearest neighbors. The hard parts are choosing the right embedding model, sizing indexes, managing cost, and keeping the system fresh as you re-embed.
Choosing an Embedding Model
The single most-impactful choice. Criteria:
| Concern | What matters |
|---|---|
| Quality | Benchmark on your data, not generic benchmarks |
| Dimensionality | Storage and compute scale with dims; 768-1536 is sweet spot |
| Cost | Per-token pricing; you'll embed millions of tokens |
| Latency | Self-hosted GPU vs SaaS API |
| Multilinguality | Cross-language? Pick a multilingual model |
| Domain fit | Code? Legal docs? Some models are domain-specialized |
| Stability | Vendor deprecating models = re-embed everything |
Models in 2026
| Model | Dims | Cost | Notes |
|---|---|---|---|
OpenAI text-embedding-3-small | 1536 | $0.02/M tokens | Default; cheap; good |
OpenAI text-embedding-3-large | 3072 (or shorter) | $0.13/M tokens | Higher quality; pricier |
Voyage voyage-3 | 1024 | $0.06/M tokens | Quality-leading for retrieval |
Cohere embed-v3 | 1024 | $0.10/M tokens | Multilingual; reranker pair |
BAAI bge-large-en-v1.5 | 1024 | Self-host | Open-source; strong baseline |
nomic-embed-text-v1.5 | 64-768 (Matryoshka) | Self-host | Configurable dims |
Run a small eval: pick 100 representative queries with known relevant docs, compare top-K recall across 2-3 models. Quality differences are real and surprising.
Matryoshka Embeddings
Newer models (OpenAI v3, Nomic) produce Matryoshka embeddings — you can truncate to fewer dimensions with minimal quality loss:
// OpenAI: get a 256-dim version of a 1536-dim embedding
const emb = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
dimensions: 256, // truncate
});256-dim is 6× cheaper to store and faster to query than 1536-dim with minor quality loss. Worth testing if cost matters.
Dimensionality Trade-offs
| Dims | Storage / vector | Query speed | Quality (typical) |
|---|---|---|---|
| 256 | 1 KB | Fastest | Good baseline (with Matryoshka models) |
| 768 | 3 KB | Fast | Strong |
| 1536 | 6 KB | Medium | Strong |
| 3072 | 12 KB | Slower | Slightly better |
Storage matters at scale. 100M vectors × 1536 dims × 4 bytes ≈ 600 GB before indexes. Plan accordingly.
Index Types
| Index | Speed | Recall | Build time | Memory |
|---|---|---|---|---|
| HNSW (Hierarchical Navigable Small World) | Fast | High | Slow | High |
| IVF (Inverted File) | Fast (after training) | Medium | Fast | Low |
| IVF + PQ (Product Quantization) | Fast | Lower | Medium | Very low |
| DiskANN / Vamana | Fast | High | Slow | Very low (disk-resident) |
| Flat (brute force) | Slow | Perfect | None | High |
HNSW is the default for most modern vector DBs — fastest queries, good recall. Build time and memory are the cost.
For >100M vectors, you'll need disk-based options (DiskANN, IVF+PQ on disk) — pure-in-memory HNSW gets expensive.
Tuning knobs:
- M (HNSW) — graph connectivity; higher = better recall, more memory. Default 16-32.
- ef_construction — quality of build; higher = slower build, better recall. Default 100-200.
- ef_search — quality of search; higher = slower query, better recall. Tune per query.
Most teams accept the defaults; revisit when you have measurable quality / latency / cost issues.
Filter Performance
Pre-filtering (apply filter, then ANN) vs post-filtering (ANN, then filter):
| Pre-filter | Post-filter | |
|---|---|---|
| Sparse filter (1% pass) | Very slow — degenerates to brute force | Fast but might return too few |
| Dense filter (50% pass) | Fast | Fast |
| Most modern DBs | Hybrid: filter index intersected with ANN | Best of both |
For very selective filters (per-user data), some DBs (Qdrant, pgvector with partial indexes) handle this well; others struggle. Test with realistic filter selectivity.
Multi-Tenancy
Same patterns as Search:
| Approach | When |
|---|---|
| One collection per tenant | Few tenants, big data each; clean isolation |
Shared collection + tenant_id filter | Many tenants; smaller data each |
| One DB per tenant | High security; expensive |
For SaaS with thousands of tenants: shared collection, tenant_id filter, index that field. For very large tenants (enterprise customers with millions of vectors each): one collection per tenant.
Re-Embedding
You'll need to re-embed when:
- Switching embedding models (vendor deprecation, quality upgrade).
- Adding new fields to the embedded text.
- The embedding model improves and you want the gains.
Plan for this from day one: store the embedding's model name + version with each vector. When re-embedding, do it in batches alongside the live index (Blue-green pattern):
1. Create new collection `products_v2`
2. Backfill: for each row, generate new embedding, upsert to v2
3. Validate quality on the new collection
4. Cut over reads (config flag)
5. Drop old collectionFor embedding generation at scale, use the provider's batch API (50% cheaper at OpenAI, similar with others) — it's not real-time but huge for backfills.
Cost Management
Two cost axes:
| Axis | Optimization |
|---|---|
| Embedding generation | Batch API (50% off); shorter chunks; smaller models |
| Storage | Lower dims (Matryoshka); compress with PQ; archive cold data |
| Query | Cache common queries; reduce ef_search; pre-filter aggressively |
| Self-host vs SaaS | At >10M vectors, often cheaper to self-host |
Real example: 10M docs × 4 chunks each = 40M embeddings. At 1536 dim × 4 bytes = 240 GB. Plus HNSW index overhead (~1.5×) = ~400 GB working set.
- Pinecone Serverless: ~$500-1500/month at this size.
- Qdrant self-host on a 256 GB RAM machine: ~$200-400/month compute + ops time.
- pgvector on managed Postgres with
pgvectorplus a beefy instance: similar.
Costs scale roughly linearly with vector count × dimensions. Pick the model dim before you embed everything.
Observability
| Signal | Why |
|---|---|
| Query latency p99 | Especially with filter selectivity changes |
| Recall on a held-out eval set | Are results getting worse? |
| Embedding API latency | Often the bottleneck of the whole RAG loop |
| Embedding API errors / rate limits | Plan retry / fallback |
| Index build time | When you add many docs at once |
| Disk / memory usage | Vector DBs are storage-bound; alert before full |
| Top-K result similarity scores | Distribution drift means model or content changed |
For RAG specifically, build an evaluation harness — a set of queries with expected answers. Run weekly. Score with RAGAS, LangSmith, or a custom rubric. Regressions are silent without it.
Stale Embeddings
If you embed user-generated content, you have a lifecycle issue: edits, deletions, re-classifications.
- On edit: re-embed and upsert.
- On delete: delete from the index (and your DB).
- On schema change (new fields in embedded text): re-embed.
The flow:
App write to DB ──► CDC / outbox ──► embedding worker ──► upsert to vector DBSame pattern as keeping any derived index in sync with a source-of-truth DB. See Background Jobs for the queue layer.
Security
- API keys for embedding models in secret manager — see Secrets.
- Per-tenant filters at the query layer — never trust client-supplied
tenant_id. - Embeddings can leak content — though hard to reverse, they encode source information. Treat them as sensitive data.
- Don't embed PII you don't need in the search context. Embed identifiers; resolve to PII per-query in your app.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Different embedding models for store vs query | All scores near zero | Same model both sides |
| Forgetting to normalize | Dot product scores look wrong | Use cosine metric, or pre-normalize |
| Embedding raw HTML / markdown | Tokens wasted on <div> etc | Strip / clean text before embedding |
| One vector per giant document | Poor retrieval | Chunk; ~500-1000 tokens per chunk |
| Filter on un-indexed field | Slow queries | Index filter fields |
| Vector-only when keyword would win | Bad results for exact-match queries | Hybrid search |
| No eval set | Silent quality regressions | Maintain a labeled eval set |
| Embedding everything ad-hoc | Costs balloon, no consistency | Centralize embedding logic in one service |
| Letting LLM make up citations | Hallucinated answers | Validate citations exist in retrieved docs |
Checklist
Production vector database checklist
- Embedding model chosen via eval, not just defaults
- Same model used for store and query
- Model name + version stored with vectors (for future re-embedding)
- Chunking strategy appropriate to content (size, overlap, semantic boundaries)
- Cosine distance with normalized vectors (or dot-product with pre-normalized)
- Index type chosen (HNSW for most; DiskANN for hyperscale)
- Metadata filters use indexed fields
- Hybrid search (vector + keyword) for retrieval, not pure vector
- Reranking for high-quality results
- Per-tenant filters enforced server-side
- Eval set + regression testing for RAG quality
- Embedding API failures handled with retries / fallback
- Re-embedding plan for model upgrades (blue-green)
- Storage and query latency monitored
- Cost monitoring (embedding tokens, vector storage)
- Embeddings treated as sensitive data
- Citations validated; LLM not trusted to make them up