Hybrid Search in Feather DB: BM25 + Dense Vectors via Reciprocal Rank Fusion

The gap between semantic and keyword retrieval

Dense vector search is excellent at semantic similarity: "what are the user's feelings about work-life balance?" retrieves memories about stress, exhaustion, overwork, and burnout even if none of those exact words appeared in the query. But embeddings have a fundamental weakness: they average meaning across the entire sequence, which makes them poor at distinguishing between tokens that carry precise identity — model names like gpt-4o-mini vs gpt-4o, version numbers like v0.14.2 vs v0.15.1, user IDs, product SKUs, or any string where the exact characters matter more than the semantic concept.

BM25 is the opposite: it excels at exact term matching, rare word recall, and keyword precision, but it has no notion of semantic similarity. A BM25 search for "feeling overwhelmed" will not find a document that says "experiencing burnout" unless those exact terms overlap.

Hybrid search combines both: dense ANN for semantic retrieval, BM25 for keyword precision, fused together so each signal compensates for the other's blind spots.

Reciprocal Rank Fusion

Feather DB uses Reciprocal Rank Fusion (RRF) to merge the dense and BM25 result lists. The RRF formula for a document d given a query q with result lists L_1...L_n is:

# RRF score for a document given multiple ranked lists
# k is a smoothing constant (default k=60 in literature, Feather DB uses k=60)
# rank_i(d) is the 1-indexed rank of document d in list i
rrf_score(d) = sum(1 / (k + rank_i(d)) for each list i)

With two lists (dense and BM25), a document ranked 1st in the dense list and 5th in the BM25 list gets:

rrf = 1/(60+1) + 1/(60+5) = 0.01639 + 0.01538 = 0.03177

A document ranked 3rd in dense and 1st in BM25:

rrf = 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03226

The document strong in BM25 wins slightly — which is the right behavior when BM25 found an exact term match. The key property of RRF is that it doesn't require the two score distributions to be on the same scale, making it more robust than weighted sum fusion.

Using hybrid search in Feather DB

import feather_db as fdb

db = fdb.DB.open("agent.feather", dim=768)

# Add some memories — the text is indexed for both dense and BM25
db.add(embed("User is working on a project using Claude claude-sonnet-4-5."),
       text="User is working on a project using Claude claude-sonnet-4-5.")
db.add(embed("User integrated the Anthropic API in their Python backend."),
       text="User integrated the Anthropic API in their Python backend.")
db.add(embed("User's app uses GPT-4o for summarization and claude-sonnet-4-5 for reasoning."),
       text="User's app uses GPT-4o for summarization and claude-sonnet-4-5 for reasoning.")

# Pure dense search — may conflate model names
dense_results = db.search(embed("claude-sonnet-4-5"), k=5)

# Hybrid search — BM25 ensures exact model name matches surface
hybrid_results = db.hybrid_search(
    query_text="claude-sonnet-4-5",
    query_vec=embed("claude-sonnet-4-5"),
    k=5,
    namespace="user-alice"  # optional: scoped hybrid search
)

When hybrid beats pure dense

The cases where hybrid search meaningfully outperforms pure dense retrieval in agent memory workloads:

Query type	Example	Why dense fails	Hybrid advantage
Model names	"claude-sonnet-4-5 vs gpt-4o"	Both embed similarly as "AI models"	BM25 exact match distinguishes them
Version numbers	"feather-db v0.15.1 changelog"	Embedding averages over version string	BM25 matches the version token precisely
Proper nouns / names	"Alice mentioned in the standup"	Dense may match "she" or "the user" too broadly	BM25 anchors on the name token
User / entity IDs	"user-42 subscription plan"	Embeddings can't distinguish arbitrary IDs	BM25 exact string match on ID token
Technical acronyms	"HNSW ef_construction parameter"	Embeddings may not distinguish HNSW-specific terms	BM25 exact match on rare technical tokens
Semantic concepts	"feeling overwhelmed at work"	N/A — dense excels here	No improvement; BM25 adds nothing

BM25 internals in Feather DB

Feather DB's BM25 index is built in C++17 and stored inside the .feather file alongside the HNSW graph and vector data. The BM25 parameters use standard defaults: k1=1.5 (term frequency saturation), b=0.75 (length normalization). The index is updated incrementally on each add() call — there's no separate indexing step.

For add_batch() calls, BM25 index updates are batched with the HNSW insertions, preserving the 3.4× throughput improvement. The BM25 vocabulary is tokenized using a fast C++ tokenizer that handles Unicode, punctuation splitting, and lowercasing. Stopwords are not removed by default — in agent memory workloads, even common words can carry meaning in context.

Tuning hybrid search

# Default hybrid: equal weight to dense and BM25 via RRF
results = db.hybrid_search(query_text, query_vec, k=10)

# Dense-heavy: when semantic retrieval should dominate
# (RRF naturally handles this if the dense ranks are consistently better)
# You can achieve this by querying with a higher ef for dense
results = db.hybrid_search(query_text, query_vec, k=10, ef=100)

# Scoped hybrid: namespace + entity + hybrid
results = db.hybrid_search(
    query_text="claude-sonnet-4-5 integration issues",
    query_vec=embed("claude-sonnet-4-5 integration issues"),
    k=10,
    namespace="user-alice",
    entity="work-context"
)

# Check scores — hybrid results include the RRF-fused score
for r in results:
    print(f"{r.text[:60]:<60}  score={r.score:.4f}")

When to use pure dense vs hybrid

Hybrid search has slightly higher latency than pure dense search because it runs both the HNSW traversal and the BM25 search, then merges. The overhead is typically 1.2–1.5× the dense-only latency at comparable k values. For most agent memory workloads, this is still sub-millisecond.

Use pure dense search when your queries are purely conceptual or emotional — "how is the user feeling about their project?" — where BM25 adds no signal. Use hybrid when queries contain specific identifiers, technical terms, or proper nouns where exact token matching matters. When in doubt, use hybrid: RRF fusion means BM25 only changes the ranking when it has strong exact-match signal, and the overhead is modest.

The practical rule: any query that contains tokens unlikely to appear verbatim in nearby semantic neighbors benefits from hybrid. Any query that is purely conceptual in nature can use dense-only.

Install: pip install feather-db · GitHub: github.com/feather-store/feather