LongMemEval Benchmark Explained: How We Measure AI Memory Quality

Performance · Feather DB · June 2026

Why benchmarks for memory are hard to trust

Most AI memory benchmarks test retrieval over a handful of documents in a single conversation. That is not how real AI agents work. Real agents talk to thousands of users over months. Facts mentioned weeks ago need to surface when they are relevant today. That is a fundamentally different problem.

LongMemEval was designed to close that gap. It is the most demanding public benchmark for AI memory systems, and it is the one we use internally to evaluate Feather DB.

What LongMemEval actually tests

The benchmark simulates a realistic personal assistant scenario. A user has ongoing conversations with an AI assistant over a simulated period of approximately three months. Facts, preferences, and events are scattered across those sessions — some mentioned once, briefly, weeks before a related question is asked.

The benchmark then asks 500 test questions that require the system to recall those facts accurately.

Example question type: "The user mentioned their daughter's name in session 14. What was it?" — and the relevant fact was buried in a single sentence among thousands of conversation turns.

The five question categories stress different failure modes:

Single-session information recall — can the system find a fact from one specific session?
Cross-session reasoning — can it combine facts across two or more separate sessions?
Knowledge updates — if a user corrected a previous fact, does the system use the new value?
Temporal reasoning — can it answer questions about when events happened, or their sequence?
Absent information — can it correctly say "I don't know" when a fact was never mentioned?

The temporal reasoning and knowledge-update categories are where most systems break down. Feather DB's current score on temporal questions is 0.417–0.477. That is honest and we track it explicitly.

Two harnesses, two things measured

LongMemEval ships with two distinct evaluation harnesses. This distinction matters — conflating them is the most common mistake when reading memory benchmark results.

Harness 1: Retrieval (recall@k)

This harness isolates the retrieval layer. Given a question, does the system return the relevant memory chunk in the top-k results?

The metric is recall@k — the fraction of questions where the ground-truth chunk appears somewhere in the top-k retrieved results. It does not care whether the system can answer the question correctly; it only asks whether the evidence was found.

Recall@k at different depths tells you about the shape of your retrieval system:

High recall@1 means you are ranking the right chunk first — good for low-latency pipelines that cannot afford to pass many chunks to the LLM.
High recall@10 with low recall@1 means you are finding the answer but burying it — your ranking is weak.
Low recall@10 means the chunk is simply not being retrieved at all — a fundamental index or chunking problem.

Harness 2: QA Accuracy (end-to-end)

This harness measures the full pipeline. The retrieved chunks are passed to an LLM answerer, and the generated answer is evaluated against the ground-truth answer using an LLM judge.

The metric is a 0–1 accuracy score over the 500 questions. This is the number most comparable across systems, since it captures both retrieval quality and the answerer's ability to reason over the retrieved context.

Feather DB BM25 baseline: retrieval results

We recently added a BM25 retrieval harness to the benchmark suite inside the feather-db package. BM25 is a classical keyword-based ranking algorithm — no embeddings, no API key, no neural anything.

The results are a useful lower-bound and a sanity check before you add semantic search on top.

Metric	Score
recall@1	0.874
recall@3	0.942
recall@5	0.974
recall@10	0.986
Total runtime	11 seconds
API key required	No

recall@1 of 0.874 means BM25 alone surfaces the right chunk as the top result in 87% of queries. recall@10 of 0.986 means the chunk is somewhere in the first ten results for 98.6% of questions.

This is useful for two reasons. First, it gives you a fast, free baseline to verify your chunking and indexing pipeline before spending money on embeddings. Second, it shows that for many production use cases, keyword retrieval is already strong — semantic search becomes the marginal improvement, not the foundation.

Run it yourself:

pip install feather-db
python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25

Feather DB end-to-end: QA accuracy

With the full pipeline — BM25+dense hybrid retrieval via RRF, adaptive decay scoring, and GPT-4o as the answerer — Feather DB scores 0.693 on LongMemEval QA accuracy.

That means on 500 questions about facts from three months of conversation history, Feather DB answers 346 of them correctly.

System	Answerer	QA Accuracy	Cost / 1K sessions
Feather DB	GPT-4o	0.693	$7.50
Full-context GPT-4o (paper baseline)	GPT-4o	0.640	$288.00
Feather DB	Gemini-2.5-Flash	0.657	~$2.40
Zep (graphiti)	GPT-4o	0.712	—

The number that matters most here is the comparison to full-context GPT-4o. The paper's baseline feeds the entire conversation history — every session, every turn — into GPT-4o's context window at query time. That is the naive approach, and it scores 0.640.

Feather DB retrieves a small focused slice of that history and scores 0.693. That is a +5.3 percentage point improvement, at 38× lower cost.

Why focused retrieval beats full context

The intuition is straightforward. When you pass 90 days of conversation into a context window, the model's attention is split across hundreds of facts, conversations, and topics. The relevant fact has to compete with everything else in that window.

When you retrieve a focused set of chunks — the three or four most relevant memory segments — the model's attention goes where it needs to go. Less noise, more signal.

Feather DB adds one more layer on top of retrieval: adaptive decay scoring. Facts that were retrieved often recently are weighted higher than facts that have been dormant. The half_life parameter (default: 14 days for agent memory workloads) controls how quickly old facts fade from the top of scored results. This matters for the knowledge-update category — if a user corrected a preference last week, that update should dominate over the original preference from two months ago.

import feather_db

cfg = feather_db.ScoringConfig(half_life=14.0, time_weight=0.4, min=0.0)
results = db.search(query_vec, k=5, scoring=cfg)

What the score of 0.693 means in practice

A few grounding points for interpreting this number:

69.3% is not perfect. Roughly 1 in 3 questions gets a wrong or incomplete answer. The benchmark is hard — temporal reasoning and knowledge updates are genuinely difficult for retrieval-based systems.
It beats the naive ceiling. Full-context GPT-4o — the approach where you just throw everything at the model and hope — scores 0.640. A system that stores and retrieves selectively is already winning on the hardest evaluation available.
The cost gap is the real story. $7.50 vs $288 per 1,000 sessions is not a marginal difference. At scale, that is the difference between a memory feature that is economically viable and one that is not.
Gemini-Flash gets you to 0.657 at $2.40 per 1,000 sessions. For latency-tolerant pipelines or cost-constrained applications, that is a strong option.

What to watch when you run it yourself

The benchmark harness is included in the feather-db package. When you run it against your own memory configuration, these are the numbers worth tracking:

Metric	What it tells you
recall@1	How often the best chunk is ranked first — critical if you only pass k=1 to the LLM
recall@5	Whether evidence is being found at a practical context budget
recall@10	Upper bound on what your retriever can possibly deliver
QA accuracy	End-to-end correctness — the number that matters to users
Retrieval latency (p50, p99)	Whether the memory layer adds meaningful latency to the response path
Cost per 1K sessions	Whether the approach scales economically

A gap between recall@10 and QA accuracy is usually an answerer problem — the evidence is being retrieved but the LLM is not using it correctly. A gap between recall@1 and recall@10 is a ranking problem — the right chunk is in the index but not being surfaced first. Both failures look identical in QA accuracy, but they have different fixes.

Running the benchmark

The full harness ships with feather-db. Two commands cover both evaluation modes:

# Retrieval harness — BM25, no API key required
python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25

# QA accuracy harness — requires OPENAI_API_KEY or GOOGLE_API_KEY
python -m feather_db.bench.longmemeval --harness qa --answerer gpt-4o

Results are written to bench/results/longmemeval_{timestamp}.json alongside the raw per-question scores. Every published number in this post was generated from those JSON files — the audit trail is reproducible.

If you are evaluating a custom memory configuration — different chunking strategy, different half-life, different k — the harness accepts flags for all of those. Run it against your own setup before committing to a production configuration.

What we are working on next

The current weak spots are temporal reasoning (0.417–0.477) and the knowledge-update category where scores plateau at 0.714 regardless of which model is used as the answerer. Both suggest the problem is in the retrieval and scoring layer, not the LLM. We are exploring explicit temporal indexing and conflict-aware update handling as next steps.

LongMemEval scores are reported in our benchmark documentation and updated with each major release.

]]>