Context Rot: How increasing input tokens impacts LLM performance
38 kellyhongsn 8 7/14/2025, 7:25:15 PM research.trychroma.com ↗
I work on research at Chroma, and I just published our latest technical report on context rot.
TLDR: Model performance is non-uniform across context lengths, including state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models.
This highlights the need for context engineering. Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.
Here is the complete open-source codebase to replicate our results: https://github.com/chroma-core/context-rot
Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).
Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.
Long story short: Context engineering is still king, RAG is not dead
Media literacy disclaimer: Chroma is a vectorDB company.