Benchmarks · Feather DB

Benchmarked · Reproducible · Open

Beats GPT-4o full-context.

Feather DB v0.8.0 + GPT-4o scores 0.693 on LongMemEval_S, beating the paper's full-context GPT-4o ceiling at 0.640. Cheap-tier with Gemini-Flash hits 0.657 for ~2.40 per full benchmark run.

0.693

LongMemEval_S

GPT-4o answerer

0.657

LongMemEval_S

Gemini-Flash · ~2.40/run

0.19ms

p50 latency

500K × 128-dim · ef=50

0.972

recall@10

SIFT1M · ef=50

LongMemEval_S leaderboard

500 questions · ~115K tokens haystack each · 5-axis scoring

Naive vector RAGpaper baseline

0.310

Full-context GPT-4o-minipaper

0.554

Full-context GPT-4opaper · the bar to beat

0.640

Feather DB + Gemini-Flashcheap-tier · ~2.40 per full run

0.657

Feather DB + GPT-4o~8 per full run

0.693

Per-axis breakdown (GPT-4o)

single-session-user 1.000single-session-assistant 0.964preference 0.767knowledge-update 0.714multi-session 0.606temporal 0.477

reproduce locally · ~8 min

pip install feather-db
git clone github.com/feather-store/feather && cd feather
python -m bench run longmemeval --dataset s --limit 0 \
    --embedder openai \
    --answerer-provider gemini --answerer-model gemini-2.5-flash \
    --decay-half-life 14 --decay-time-weight 0.4 --k 10

Try the demo

Hugging Face Space

Per-question results

HF Dataset · audit trail

bench/results/*.json

GitHub · raw JSON for every run

Read the full benchmark article LongMemEval paper (Xu et al., ICLR 2025)