Steven's Knowledge

Hybrid Search

Combining keyword and vector search, filters, chunking, reranking - what real RAG looks like

Hybrid Search

Vector search alone is rarely the answer. The state of the art for RAG and search is hybrid — combine keyword (BM25) and vector results, optionally rerank. This page covers how.

The Problem with Pure Vector

Keyword catchesVectors miss
Exact product names ("iPhone 16 Pro Max")Names embed similarly to other phones
Codes ("ERR_404", "SKU-12345")Numeric strings barely contribute to embeddings
Rare technical termsOut-of-distribution words for the embedding model
Boolean / Filter operationsSoft semantic matching only
Vectors catchKeyword misses
"Couch" ↔ "sofa"Different words → no overlap
Paraphrasing"I want to cancel my subscription" vs "how do I unsubscribe"
Multi-languageCross-language similarity (with multilingual model)
Conceptual proximity"coffee maker" ↔ "espresso machine"

Neither alone is enough. The combination beats either.

The Hybrid Pattern

Query

  ├──► Keyword search (BM25)  ──► top-K results + scores

  ├──► Embed query ──► Vector search ──► top-K results + scores

  └──► Combine (RRF or weighted) ──► top-K hybrid results

  └──► Optional: rerank with a cross-encoder ──► final top-N

Both retrievers return overlapping (but not identical) candidate sets. Combining is the trick.

Reciprocal Rank Fusion (RRF)

The simplest combination. For each document, sum 1 / (k + rank) across retrievers:

function rrf(rankedLists, k = 60) {
  const scores = new Map();

  for (const list of rankedLists) {
    list.forEach((doc, rank) => {
      scores.set(doc.id, (scores.get(doc.id) ?? 0) + 1 / (k + rank + 1));
    });
  }

  return [...scores.entries()]
    .sort(([, a], [, b]) => b - a)
    .map(([id, score]) => ({ id, score }));
}

// Usage
const keywordResults = await typesense.search(query);          // [{id: '7', ...}, {id: '12', ...}]
const vectorResults  = await qdrant.search('products', { vector: emb });
const merged = rrf([keywordResults, vectorResults]);

RRF doesn't need scores to be on comparable scales — it only uses ranks. The most robust merge across very different retrievers.

Weighted Combination

If you want to tune the keyword/vector weight:

function weightedFusion(keywordResults, vectorResults, alpha = 0.5) {
  const scores = new Map();

  for (const r of keywordResults) {
    scores.set(r.id, (scores.get(r.id) ?? 0) + alpha * normalize(r.score));
  }
  for (const r of vectorResults) {
    scores.set(r.id, (scores.get(r.id) ?? 0) + (1 - alpha) * normalize(r.score));
  }

  return [...scores.entries()].sort(([, a], [, b]) => b - a);
}

alpha between 0 (pure vector) and 1 (pure keyword). 0.3-0.5 is a common sweet spot.

Tuning alpha is best done with an evaluation set — examples of queries with known "good" results.

Most modern vector DBs ship hybrid out of the box:

Qdrant (server-side fusion)

const results = await qdrant.searchBatch('products', [{
  query: queryEmbedding,
  prefetch: [
    {
      query: queryText,
      using: 'bm25',
      limit: 20,
    },
    {
      query: queryEmbedding,
      using: 'dense',
      limit: 20,
    },
  ],
  using: 'rrf',
  limit: 10,
}]);

Weaviate

const results = await client.graphql.get()
  .withClassName('Product')
  .withHybrid({ query: 'coffee maker', alpha: 0.5 })
  .withLimit(10)
  .do();

Elasticsearch / OpenSearch

Native hybrid via rrf retriever or RRF post-filter.

pgvector

You combine queries yourself:

WITH keyword_results AS (
  SELECT id, ts_rank(document, query) AS score, row_number() OVER (ORDER BY ts_rank DESC) AS rank
  FROM products, plainto_tsquery('coffee maker') query
  WHERE document @@ query
  LIMIT 20
),
vector_results AS (
  SELECT id, 1 - (embedding <=> '[0.21, ...]'::vector) AS score,
         row_number() OVER (ORDER BY embedding <=> '[0.21, ...]'::vector) AS rank
  FROM products
  ORDER BY embedding <=> '[0.21, ...]'::vector
  LIMIT 20
)
SELECT id, SUM(1.0 / (60 + rank)) AS score
FROM (
  SELECT id, rank FROM keyword_results
  UNION ALL
  SELECT id, rank FROM vector_results
) u
GROUP BY id
ORDER BY score DESC
LIMIT 10;

A bit more code but no extra service.

Chunking

Long documents (PDFs, articles, manuals) don't embed well as one vector — too much disparate content compressed into one point. Chunk first.

StrategyWhen
Fixed-size chunks (e.g., 500 tokens)Simple; works OK
Sliding window (500-token chunks with 50-token overlap)Better; queries hitting boundaries still find context
Semantic chunking (split at paragraph / section boundaries)Best; respects content structure
Hierarchical (chunk + parent chunk references)Retrieve small for precision, expand to parent for context
One chunk per logical unit (FAQ entry, code function)Match the unit users actually want

For most RAG: 500-1000 tokens per chunk, 10-20% overlap, store the parent doc reference. Retrieve the chunk; pass the parent doc as context.

// Naive chunker
function chunk(text, size = 500, overlap = 50) {
  const tokens = tokenize(text);
  const chunks = [];
  for (let i = 0; i < tokens.length; i += size - overlap) {
    chunks.push(tokens.slice(i, i + size).join(' '));
  }
  return chunks;
}

Real chunking: langchain.text_splitter.RecursiveCharacterTextSplitter or llama_index.SimpleNodeParser do this well.

Reranking

After hybrid retrieval, a cross-encoder model can re-score by taking the query and each candidate together:

Hybrid retrieval (50 candidates)


Cross-encoder: score(query, doc1), score(query, doc2), ...


Top 5 by cross-encoder score → pass to LLM

Cross-encoders are slower than bi-encoders (the standard embedding model) but much more accurate — they look at the query and document together. You can afford to score 50 candidates this way per query.

Popular rerankers:

  • Cohere Rerank — SaaS; commercial; high quality
  • Voyage AI rerank — SaaS; growing
  • bge-reranker-large — open-source; self-host
  • mxbai-rerank-large-v1 — open-source; faster
import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerank(query, candidates) {
  const result = await cohere.rerank({
    query,
    documents: candidates.map(c => c.text),
    topN: 5,
    model: 'rerank-english-v3.0',
  });
  return result.results.map(r => candidates[r.index]);
}

For latency-tolerant queries, reranking is usually the single biggest quality lift after hybrid retrieval.

Filters and Metadata

A common pattern: vector search, but only within a user's data:

const results = await qdrant.search('docs', {
  vector: queryEmb,
  filter: {
    must: [
      { key: 'org_id', match: { value: userOrgId } },
      { key: 'created_at', range: { gte: lastWeekTimestamp } },
    ],
  },
  limit: 20,
});

Per-tenant filtering is the multi-tenant story. Index your filter fields for fast pre-filtering — must filters are evaluated before ANN search.

The Full Stack: Production RAG

User question

  ├──► Query rewrite / expansion (optional, LLM-driven)

  ├──► Hybrid retrieval (vector + BM25, top 50)

  ├──► Metadata filter (per-tenant, recency)

  ├──► Rerank (cross-encoder, top 5)

  ├──► Build context with citations

  └──► LLM with context

       └──► Answer with [1][2][3] citations

            └──► Verify citations actually exist (don't trust LLM)

Each step matters; ablating any one degrades quality measurably. The hardest part isn't the vector DB — it's everything around it.

What's Next

You can do hybrid retrieval at production quality. Best Practices covers operations — embedding model choice, dimensions, indexes, cost, observability.

On this page