Why RAG Stops Working After 90 Days (and What a Living Engine Does Instead)

Theory · Living Context Engine Series · May 2026

The 90-Day Curve

Anyone who has run a production RAG pipeline at scale has seen this curve. Quality is excellent in week one. Steady through month one. Uneven by month two. By month three, output quality has visibly degraded — the same queries that used to return crisp, on-point results now surface a mix of stale documents, irrelevant chunks, and the right answer buried at rank seven.

The temptation is to blame the embedding model, the chunking strategy, or the LLM. Those are downstream. The actual cause is upstream and structural. RAG pipelines have four design assumptions that fail around month three.

Failure Mode 1: The Corpus Grows, the Index Doesn't Forget

Every new document added increases the size of the candidate set but does not displace older documents. After 90 days of active ingestion, your top-k results are competing against 10x more candidates than they were on day one. A query that used to surface the right brief at rank 2 now surfaces it at rank 6 — same similarity score, more competition.

The Living Engine fix: decay-weighted scoring. Old documents are not deleted, but their effective rank drops over time unless they continue to be recalled. The hot working set stays small even as the cold archive grows.

Failure Mode 2: Chunk Drift

RAG pipelines chunk documents at ingestion time. The chunking strategy was tuned in week one against a specific corpus distribution. By month three, the distribution has shifted — longer documents, more code blocks, more multilingual content — and the original chunk size produces fragments that no longer correspond to coherent thoughts.

The Living Engine fix: semantic-unit storage. Store thoughts as nodes, not pages as chunks. A node is a complete idea — a brief paragraph, a strategy bullet, a metric definition — sized to what a human would consider one piece of context. Re-chunking pages is then not necessary; you re-index thoughts.

Failure Mode 3: Relationship Loss

Every chunk in a RAG corpus is an island. The chunk knows nothing about the document it came from, the campaign that referenced it, the post-mortem that responded to it. By month three, the corpus has accumulated thousands of related-but-disconnected fragments, and the retrieval layer cannot surface the connections.

The Living Engine fix: typed graph edges. Every node has typed outgoing edges to related nodes. Retrieval is a fused vector + graph operation that returns a connected subgraph, not an unordered list. The relationships that disappear in a chunked corpus are preserved as first-class structure.

Failure Mode 4: No Feedback Loop

The most damaging structural absence in a standard RAG pipeline is that retrieved chunks do not learn from being used. An agent that uses a chunk well, or quotes it in a successful output, has no way to signal that fact back to the index. By month three, the index has no information about which of its entries are load-bearing and which are dead weight.

The Living Engine fix: closed feedback loop. Successful retrievals increment recall counters. Agent outputs are written back as new context nodes with edges to their inputs. The system gets quieter where it doesn't matter and louder where it does.

The Composite Cure

Each fix above is useful alone. Together, they invert the 90-day curve into a 90-day improvement curve. Frequently-used context becomes more present. The corpus self-curates. Cross-document relationships are first-class. The system carries memory of what worked.

The architectural diagram is identical to standard RAG — encode, store, retrieve, generate — but each box has different semantics. Storage is a unified node store with decay state. Retrieval is a fused vector + graph traversal. Output is written back as new context. The plumbing is the same; the behavior is fundamentally different.

What This Means in Practice

A team running a static RAG pipeline for an internal AI assistant will eventually have to choose between three bad options:

Rebuild the corpus periodically. Expensive, manual, and the new corpus has the same flaw on a delay.
Add filtering heuristics. Each heuristic helps once and rots.
Accept the quality ceiling. The most common outcome — and the reason "the AI feels generic" is a near-universal complaint.

The Living Context Engine option is a fourth path: replace the index with a memory, and let the memory be the curation layer. The first three options are workarounds for the structural absence. The fourth is the structural fix.

Part of the Living Context Engine series. Next: The Context Half-Life Problem.