Hybrid Search in Feather DB: BM25 + Dense Vectors Combined

Architecture · Feather DB · June 2026

The Problem With Picking One

Every search system makes a tradeoff. The two dominant approaches — keyword scoring and semantic vector search — each fail in a predictable and complementary way. Hybrid search in Feather DB exists because that failure is avoidable.

Here's what breaks, and why.

What BM25 Does

BM25 (Best Match 25) is a term frequency-inverse document frequency scoring function. For every query term, it asks two things:

How often does this term appear in the document? (term frequency, with saturation — the 50th occurrence of "vector" matters less than the 5th)
How rare is this term across the corpus? (inverse document frequency — "the" scores near zero, "HNSW" scores high)

The final BM25 score is a weighted sum across all query terms. Documents with rare, frequent terms rank highest.

BM25 is fast. No embedding model needed. No GPU. No API key. You tokenize, you score, you rank. On 500 queries over a 10K-document corpus, it runs in about 11 seconds on a single CPU thread.

And the recall numbers are genuinely good:

Metric	BM25 Score
recall@1	0.874
recall@3	0.942
recall@5	0.974
recall@10	0.986

These are Feather DB's standalone BM25 results on a 500-query benchmark, no API key required. recall@10 of 0.986 means BM25 finds the right document in the top 10 results 98.6% of the time — if the query uses the exact words the document uses.

That last clause is the problem.

What Dense Vector Search Does

Dense vector search works differently. You run your query through an embedding model — OpenAI's text-embedding-3-small, Gemini's gemini-embedding-exp-03-07, or any other — and get back a high-dimensional float vector that encodes semantic meaning, not token identity.

Documents are pre-embedded the same way. At query time, you find the documents whose vectors are closest to the query vector in that embedding space — nearest neighbor search.

Feather DB implements this with HNSW (Hierarchical Navigable Small World graphs), accelerated with AVX2/AVX512 SIMD on x86. The structure lets you approximate nearest neighbors in O(log n) time rather than scanning every vector. p50 ANN latency on 500K vectors: 0.19ms.

What dense search does that BM25 cannot: it understands paraphrases. "Car" and "automobile" land near each other in embedding space. "My API keeps throwing 429 errors" and "rate limiting in production" surface the same documents. The model learned semantic proximity from training on language, not from token overlap.

What dense search misses: exact tokens. If a user queries SKU-10042 or GPT-4o-mini or feather_db.DB.open(), the embedding model compresses those into a region of a 768-dimensional space shared with vaguely similar strings. The exact character sequence stops mattering. A document containing SKU-10042 verbatim may not rank above a document that "sounds like" product identifiers in general.

Why Neither Alone Is Enough

The failure modes are symmetric:

BM25 misses paraphrases. "Authorization failed" vs "access denied" — same error, different tokens, zero overlap score. BM25 returns nothing useful.
Dense misses exact tokens. CVE-2024-38816, order #TXN-8821, --ef-construction=400 — the embedding model treats these as opaque blobs and often ranks near-meaningless neighbors above the exact match.

Real user queries are a mix of both patterns. A developer searching "HNSW recall drops with ef below 50" needs dense search to understand the concept, but also needs keyword match to surface the exact parameter name. A support agent searching "customer ID C-48821 refund request" needs exact ID match from BM25 and semantic context from dense.

Hybrid search is not a compromise. It's the correct answer.

Feather DB's Hybrid Approach

Feather DB computes both scores at query time and fuses them into a single ranked list.

The fusion method is a weighted linear combination of normalized scores:

hybrid_score = alpha * dense_score + (1 - alpha) * bm25_score

alpha controls the balance. At alpha=1.0 you get pure dense. At alpha=0.0 you get pure BM25. At alpha=0.7 — the default — dense search leads and BM25 re-ranks against exact token matches.

Before combining, scores are normalized to [0, 1] within each result set. BM25 scores are unbounded floats; cosine similarity scores are [-1, 1]. Min-max normalization within the candidate set makes them comparable before the weighted sum.

The candidate set is the union of top-K results from both retrieval passes. A document that BM25 misses but dense finds (or vice versa) is still eligible for the final ranking. Neither retriever can veto a result — only the combined score determines the final order.

The API

Three search modes, one method:

import feather_db
import numpy as np

db = feather_db.DB.open("knowledge.feather", dim=768)

# Your query, embedded by whatever model you're using
query_vec = embed("authorization failed connecting to database")

# Mode 1: pure dense (semantic similarity only)
results_dense = db.search(query_vec, k=10, mode="dense")

# Mode 2: pure keyword (BM25 only — no embedding needed at search time)
results_bm25 = db.search(query_vec, k=10, mode="keyword")

# Mode 3: hybrid (default — weighted combination)
results_hybrid = db.search(query_vec, k=10, mode="hybrid")

# Adjust the balance: alpha=0.7 means 70% dense, 30% BM25
results_tuned = db.search(query_vec, k=10, mode="hybrid", alpha=0.7)

The mode parameter is the only required addition. All other search arguments (k, filter, metadata filtering) work identically across modes.

Side-by-Side Comparison

The same query, three modes. Query: "SKU-10042 out of stock notification".

import feather_db

db = feather_db.DB.open("products.feather", dim=768)
query_vec = embed("SKU-10042 out of stock notification")

print("=== DENSE ONLY ===")
for r in db.search(query_vec, k=3, mode="dense"):
    print(f"  [{r.score:.3f}] {r.meta.get_attribute('title')}")

# Output (dense only):
#   [0.912] Inventory notification system overview
#   [0.887] Managing product availability alerts
#   [0.871] Out of stock handling best practices
# — SKU-10042 document does not appear in top 3

print("\n=== BM25 ONLY ===")
for r in db.search(query_vec, k=3, mode="keyword"):
    print(f"  [{r.score:.3f}] {r.meta.get_attribute('title')}")

# Output (BM25 only):
#   [0.998] SKU-10042: Product page and inventory record
#   [0.743] Notification triggers for SKU-level events
#   [0.681] Stock threshold configuration for SKU-10042

print("\n=== HYBRID (alpha=0.7) ===")
for r in db.search(query_vec, k=3, mode="hybrid", alpha=0.7):
    print(f"  [{r.score:.3f}] {r.meta.get_attribute('title')}")

# Output (hybrid):
#   [0.961] SKU-10042: Product page and inventory record
#   [0.934] Inventory notification system overview
#   [0.891] Notification triggers for SKU-level events
# — exact match surfaces first, semantic context fills positions 2-3

Hybrid gets both: the exact SKU document ranks first (BM25 contribution), and the semantic context documents rank immediately after (dense contribution). Neither mode alone produces this result.

Score Fusion: Weighting BM25 vs Dense

The default alpha=0.7 is a reasonable starting point, not a universal truth. How to tune it:

alpha closer to 1.0 — query is open-ended, conceptual, paraphrase-heavy. "What causes high latency in vector search?" Dense dominates; BM25 adds light re-ranking for technical terms.
alpha closer to 0.5 — query mixes concepts with specific identifiers. "HNSW ef parameter tuning for recall@10". Equal weight; both signals matter.
alpha closer to 0.0 — query is a lookup by exact token. "Transaction TXN-8821 status". BM25 dominates; dense is noise.

In practice, most user-facing search interfaces benefit from alpha=0.65–0.75. Log query patterns for a week, find the queries that return wrong top-1 results, and nudge alpha in the direction that fixes the majority.

Production Tip: Match Mode to Task

Not every search is a user query. Different tasks have different optimal modes:

Task	Recommended mode	Reason
User search bar query	`hybrid`	Mix of intent types; covers both exact and semantic
"Find similar documents"	`dense`	Pure semantic — no exact token expected
ID / SKU / code lookup	`keyword`	Exact token match; dense adds noise
Agent memory retrieval	`hybrid`	Agents mix conceptual reasoning with specific references
Deduplication check	`dense`	Near-duplicate detection is a semantic problem
Citation / reference lookup	`keyword`	Exact title / DOI / reference string match

The rule of thumb: use hybrid when a human typed the query; use dense when the query is a vector derived from another document; use keyword when the query contains a code or identifier the document should contain verbatim.

When Hybrid Outperforms

Hybrid has the largest margin over single-mode search in three cases:

Product names and codes. "iPhone 15 Pro camera settings" — dense finds camera documentation; BM25 pins the exact product. Hybrid surfaces the right product's camera documentation first.
Technical identifiers mixed with natural language. "Why does ef_construction=200 improve recall but hurt index time?" — without BM25, ef_construction floats in embedding space near unrelated parameters. BM25 anchors the exact string.
Short, ambiguous queries with a key token. "GPT-4o pricing" — two words. Dense interprets pricing broadly. BM25 locks on "GPT-4o." Hybrid gets both right.

What This Looks Like Internally

Feather DB's search pipeline for mode="hybrid":

HNSW ANN search returns top-K*2 candidates by cosine similarity. (Over-fetch to increase recall before re-ranking.)
BM25 index scores the same query against the full inverted index. Returns top-K*2 candidates by BM25 score.
Union of both candidate sets is formed. Documents appearing in both sets carry scores from both passes. Documents appearing in only one carry a score of 0.0 for the other pass.
Scores are min-max normalized within each list. The dense list is normalized independently from the BM25 list.
Weighted sum: alpha * dense_norm + (1-alpha) * bm25_norm.
Final list is sorted descending. Top k returned.

The BM25 index is built at ingestion time from the content attribute of each document's metadata. No separate indexing call needed — it's maintained in the .feather file alongside the HNSW graph.

Bottom Line

BM25 recall@10 of 0.986 is remarkable for a zero-dependency, 11-second run on 500 queries. Dense search at p50 of 0.19ms ANN latency is fast enough for any real-time use case. Hybrid combines both in a single db.search() call.

The decision is not "which is better." It's "which failure mode can I not afford." In most AI agent and user-facing search contexts, the answer is both — which makes mode="hybrid" the right default.

# Start here. Tune alpha if needed.
results = db.search(query_vec, k=10, mode="hybrid", alpha=0.7)