Combining keyword and vector search, filters, chunking, reranking - what real RAG looks like

Hybrid Search

Vector search alone is rarely the answer. The state of the art for RAG and search is hybrid — combine keyword (BM25) and vector results, optionally rerank. This page covers how.

The Problem with Pure Vector

Keyword catches	Vectors miss
Exact product names ("iPhone 16 Pro Max")	Names embed similarly to other phones
Codes ("ERR_404", "SKU-12345")	Numeric strings barely contribute to embeddings
Rare technical terms	Out-of-distribution words for the embedding model
Boolean / Filter operations	Soft semantic matching only

Vectors catch	Keyword misses
"Couch" ↔ "sofa"	Different words → no overlap
Paraphrasing	"I want to cancel my subscription" vs "how do I unsubscribe"
Multi-language	Cross-language similarity (with multilingual model)
Conceptual proximity	"coffee maker" ↔ "espresso machine"

Neither alone is enough. The combination beats either.

The Hybrid Pattern

Query
  │
  ├──► Keyword search (BM25)  ──► top-K results + scores
  │
  ├──► Embed query ──► Vector search ──► top-K results + scores
  │
  └──► Combine (RRF or weighted) ──► top-K hybrid results
  │
  └──► Optional: rerank with a cross-encoder ──► final top-N

Both retrievers return overlapping (but not identical) candidate sets. Combining is the trick.

Reciprocal Rank Fusion (RRF)

The simplest combination. For each document, sum 1 / (k + rank) across retrievers:

function rrf(rankedLists, k = 60) {
  const scores = new Map();

  for (const list of rankedLists) {
    list.forEach((doc, rank) => {
      scores.set(doc.id, (scores.get(doc.id) ?? 0) + 1 / (k + rank + 1));
    });
  }

  return [...scores.entries()]
    .sort(([, a], [, b]) => b - a)
    .map(([id, score]) => ({ id, score }));
}

// Usage
const keywordResults = await typesense.search(query);          // [{id: '7', ...}, {id: '12', ...}]
const vectorResults  = await qdrant.search('products', { vector: emb });
const merged = rrf([keywordResults, vectorResults]);

RRF doesn't need scores to be on comparable scales — it only uses ranks. The most robust merge across very different retrievers.

Weighted Combination

If you want to tune the keyword/vector weight:

function weightedFusion(keywordResults, vectorResults, alpha = 0.5) {
  const scores = new Map();

  for (const r of keywordResults) {
    scores.set(r.id, (scores.get(r.id) ?? 0) + alpha * normalize(r.score));
  }
  for (const r of vectorResults) {
    scores.set(r.id, (scores.get(r.id) ?? 0) + (1 - alpha) * normalize(r.score));
  }

  return [...scores.entries()].sort(([, a], [, b]) => b - a);
}

alpha between 0 (pure vector) and 1 (pure keyword). 0.3-0.5 is a common sweet spot.

Tuning alpha is best done with an evaluation set — examples of queries with known "good" results.

Chunking

Long documents (PDFs, articles, manuals) don't embed well as one vector — too much disparate content compressed into one point. Chunk first.

Strategy	When
Fixed-size chunks (e.g., 500 tokens)	Simple; works OK
Sliding window (500-token chunks with 50-token overlap)	Better; queries hitting boundaries still find context
Semantic chunking (split at paragraph / section boundaries)	Best; respects content structure
Hierarchical (chunk + parent chunk references)	Retrieve small for precision, expand to parent for context
One chunk per logical unit (FAQ entry, code function)	Match the unit users actually want

For most RAG: 500-1000 tokens per chunk, 10-20% overlap, store the parent doc reference. Retrieve the chunk; pass the parent doc as context.

// Naive chunker
function chunk(text, size = 500, overlap = 50) {
  const tokens = tokenize(text);
  const chunks = [];
  for (let i = 0; i < tokens.length; i += size - overlap) {
    chunks.push(tokens.slice(i, i + size).join(' '));
  }
  return chunks;
}

Real chunking: langchain.text_splitter.RecursiveCharacterTextSplitter or llama_index.SimpleNodeParser do this well.

Reranking

After hybrid retrieval, a cross-encoder model can re-score by taking the query and each candidate together:

Hybrid retrieval (50 candidates)
  │
  ▼
Cross-encoder: score(query, doc1), score(query, doc2), ...
  │
  ▼
Top 5 by cross-encoder score → pass to LLM

Cross-encoders are slower than bi-encoders (the standard embedding model) but much more accurate — they look at the query and document together. You can afford to score 50 candidates this way per query.

Popular rerankers:

Cohere Rerank — SaaS; commercial; high quality
Voyage AI rerank — SaaS; growing
bge-reranker-large — open-source; self-host
mxbai-rerank-large-v1 — open-source; faster

import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerank(query, candidates) {
  const result = await cohere.rerank({
    query,
    documents: candidates.map(c => c.text),
    topN: 5,
    model: 'rerank-english-v3.0',
  });
  return result.results.map(r => candidates[r.index]);
}

For latency-tolerant queries, reranking is usually the single biggest quality lift after hybrid retrieval.

Filters and Metadata

A common pattern: vector search, but only within a user's data:

const results = await qdrant.search('docs', {
  vector: queryEmb,
  filter: {
    must: [
      { key: 'org_id', match: { value: userOrgId } },
      { key: 'created_at', range: { gte: lastWeekTimestamp } },
    ],
  },
  limit: 20,
});

Per-tenant filtering is the multi-tenant story. Index your filter fields for fast pre-filtering — must filters are evaluated before ANN search.

The Full Stack: Production RAG

User question
  │
  ├──► Query rewrite / expansion (optional, LLM-driven)
  │
  ├──► Hybrid retrieval (vector + BM25, top 50)
  │
  ├──► Metadata filter (per-tenant, recency)
  │
  ├──► Rerank (cross-encoder, top 5)
  │
  ├──► Build context with citations
  │
  └──► LLM with context
       │
       └──► Answer with [1][2][3] citations
            │
            └──► Verify citations actually exist (don't trust LLM)

Each step matters; ablating any one degrades quality measurably. The hardest part isn't the vector DB — it's everything around it.

What's Next

You can do hybrid retrieval at production quality. Best Practices covers operations — embedding model choice, dimensions, indexes, cost, observability.

Hybrid Search

Hybrid Search

The Problem with Pure Vector

The Hybrid Pattern

Reciprocal Rank Fusion (RRF)

Weighted Combination

Native Hybrid Search

Qdrant (server-side fusion)

Weaviate

Elasticsearch / OpenSearch

pgvector

Chunking

Reranking

Filters and Metadata

The Full Stack: Production RAG

What's Next

On this page