Hybrid Search
Combining keyword and vector search, filters, chunking, reranking - what real RAG looks like
Hybrid Search
Vector search alone is rarely the answer. The state of the art for RAG and search is hybrid — combine keyword (BM25) and vector results, optionally rerank. This page covers how.
The Problem with Pure Vector
| Keyword catches | Vectors miss |
|---|---|
| Exact product names ("iPhone 16 Pro Max") | Names embed similarly to other phones |
| Codes ("ERR_404", "SKU-12345") | Numeric strings barely contribute to embeddings |
| Rare technical terms | Out-of-distribution words for the embedding model |
| Boolean / Filter operations | Soft semantic matching only |
| Vectors catch | Keyword misses |
|---|---|
| "Couch" ↔ "sofa" | Different words → no overlap |
| Paraphrasing | "I want to cancel my subscription" vs "how do I unsubscribe" |
| Multi-language | Cross-language similarity (with multilingual model) |
| Conceptual proximity | "coffee maker" ↔ "espresso machine" |
Neither alone is enough. The combination beats either.
The Hybrid Pattern
Query
│
├──► Keyword search (BM25) ──► top-K results + scores
│
├──► Embed query ──► Vector search ──► top-K results + scores
│
└──► Combine (RRF or weighted) ──► top-K hybrid results
│
└──► Optional: rerank with a cross-encoder ──► final top-NBoth retrievers return overlapping (but not identical) candidate sets. Combining is the trick.
Reciprocal Rank Fusion (RRF)
The simplest combination. For each document, sum 1 / (k + rank) across retrievers:
function rrf(rankedLists, k = 60) {
const scores = new Map();
for (const list of rankedLists) {
list.forEach((doc, rank) => {
scores.set(doc.id, (scores.get(doc.id) ?? 0) + 1 / (k + rank + 1));
});
}
return [...scores.entries()]
.sort(([, a], [, b]) => b - a)
.map(([id, score]) => ({ id, score }));
}
// Usage
const keywordResults = await typesense.search(query); // [{id: '7', ...}, {id: '12', ...}]
const vectorResults = await qdrant.search('products', { vector: emb });
const merged = rrf([keywordResults, vectorResults]);RRF doesn't need scores to be on comparable scales — it only uses ranks. The most robust merge across very different retrievers.
Weighted Combination
If you want to tune the keyword/vector weight:
function weightedFusion(keywordResults, vectorResults, alpha = 0.5) {
const scores = new Map();
for (const r of keywordResults) {
scores.set(r.id, (scores.get(r.id) ?? 0) + alpha * normalize(r.score));
}
for (const r of vectorResults) {
scores.set(r.id, (scores.get(r.id) ?? 0) + (1 - alpha) * normalize(r.score));
}
return [...scores.entries()].sort(([, a], [, b]) => b - a);
}alpha between 0 (pure vector) and 1 (pure keyword). 0.3-0.5 is a common sweet spot.
Tuning alpha is best done with an evaluation set — examples of queries with known "good" results.
Native Hybrid Search
Most modern vector DBs ship hybrid out of the box:
Qdrant (server-side fusion)
const results = await qdrant.searchBatch('products', [{
query: queryEmbedding,
prefetch: [
{
query: queryText,
using: 'bm25',
limit: 20,
},
{
query: queryEmbedding,
using: 'dense',
limit: 20,
},
],
using: 'rrf',
limit: 10,
}]);Weaviate
const results = await client.graphql.get()
.withClassName('Product')
.withHybrid({ query: 'coffee maker', alpha: 0.5 })
.withLimit(10)
.do();Elasticsearch / OpenSearch
Native hybrid via rrf retriever or RRF post-filter.
pgvector
You combine queries yourself:
WITH keyword_results AS (
SELECT id, ts_rank(document, query) AS score, row_number() OVER (ORDER BY ts_rank DESC) AS rank
FROM products, plainto_tsquery('coffee maker') query
WHERE document @@ query
LIMIT 20
),
vector_results AS (
SELECT id, 1 - (embedding <=> '[0.21, ...]'::vector) AS score,
row_number() OVER (ORDER BY embedding <=> '[0.21, ...]'::vector) AS rank
FROM products
ORDER BY embedding <=> '[0.21, ...]'::vector
LIMIT 20
)
SELECT id, SUM(1.0 / (60 + rank)) AS score
FROM (
SELECT id, rank FROM keyword_results
UNION ALL
SELECT id, rank FROM vector_results
) u
GROUP BY id
ORDER BY score DESC
LIMIT 10;A bit more code but no extra service.
Chunking
Long documents (PDFs, articles, manuals) don't embed well as one vector — too much disparate content compressed into one point. Chunk first.
| Strategy | When |
|---|---|
| Fixed-size chunks (e.g., 500 tokens) | Simple; works OK |
| Sliding window (500-token chunks with 50-token overlap) | Better; queries hitting boundaries still find context |
| Semantic chunking (split at paragraph / section boundaries) | Best; respects content structure |
| Hierarchical (chunk + parent chunk references) | Retrieve small for precision, expand to parent for context |
| One chunk per logical unit (FAQ entry, code function) | Match the unit users actually want |
For most RAG: 500-1000 tokens per chunk, 10-20% overlap, store the parent doc reference. Retrieve the chunk; pass the parent doc as context.
// Naive chunker
function chunk(text, size = 500, overlap = 50) {
const tokens = tokenize(text);
const chunks = [];
for (let i = 0; i < tokens.length; i += size - overlap) {
chunks.push(tokens.slice(i, i + size).join(' '));
}
return chunks;
}Real chunking: langchain.text_splitter.RecursiveCharacterTextSplitter or llama_index.SimpleNodeParser do this well.
Reranking
After hybrid retrieval, a cross-encoder model can re-score by taking the query and each candidate together:
Hybrid retrieval (50 candidates)
│
▼
Cross-encoder: score(query, doc1), score(query, doc2), ...
│
▼
Top 5 by cross-encoder score → pass to LLMCross-encoders are slower than bi-encoders (the standard embedding model) but much more accurate — they look at the query and document together. You can afford to score 50 candidates this way per query.
Popular rerankers:
- Cohere Rerank — SaaS; commercial; high quality
- Voyage AI rerank — SaaS; growing
bge-reranker-large— open-source; self-hostmxbai-rerank-large-v1— open-source; faster
import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function rerank(query, candidates) {
const result = await cohere.rerank({
query,
documents: candidates.map(c => c.text),
topN: 5,
model: 'rerank-english-v3.0',
});
return result.results.map(r => candidates[r.index]);
}For latency-tolerant queries, reranking is usually the single biggest quality lift after hybrid retrieval.
Filters and Metadata
A common pattern: vector search, but only within a user's data:
const results = await qdrant.search('docs', {
vector: queryEmb,
filter: {
must: [
{ key: 'org_id', match: { value: userOrgId } },
{ key: 'created_at', range: { gte: lastWeekTimestamp } },
],
},
limit: 20,
});Per-tenant filtering is the multi-tenant story. Index your filter fields for fast pre-filtering — must filters are evaluated before ANN search.
The Full Stack: Production RAG
User question
│
├──► Query rewrite / expansion (optional, LLM-driven)
│
├──► Hybrid retrieval (vector + BM25, top 50)
│
├──► Metadata filter (per-tenant, recency)
│
├──► Rerank (cross-encoder, top 5)
│
├──► Build context with citations
│
└──► LLM with context
│
└──► Answer with [1][2][3] citations
│
└──► Verify citations actually exist (don't trust LLM)Each step matters; ablating any one degrades quality measurably. The hardest part isn't the vector DB — it's everything around it.
What's Next
You can do hybrid retrieval at production quality. Best Practices covers operations — embedding model choice, dimensions, indexes, cost, observability.