I noticed something interesting this week while running Claude MPM over Claude Code. When Claude Code reported having 10% context remaining until auto-compact, my PM agent (which independently monitors session state) showed something different: Used 128k/200k tokens—only 64% of available context.
That 54-percentage-point gap got me thinking. What if Claude Code’s recent performance improvements aren’t primarily about better code generation or smarter prompting? What if they stem from something more fundamental: reserving more free context space to maintain reasoning quality?
Here’s my working hypothesis: Claude Code has been progressively pushing its auto-compact threshold down—stopping earlier to preserve more working memory. In the old days (just several weeks ago), Claude Code would run until it couldn’t, sometimes failing to compact because it didn’t have enough free space left. Now it appears to be stopping much earlier, maintaining substantial breathing room for the LLM to actually think.
And if that’s true, it demonstrates something I’ve been writing about for months: infrastructure matters more than features. Sometimes improving tools means constraining them more intelligently rather than pushing them harder.
TL;DR
• Working hypothesis: Claude Code triggers auto-compact much earlier than before—potentially around 64-75% context usage vs. historical 90%+ • Engineers appear to have built in a “completion buffer” giving tasks room to finish before compaction, eliminating disruptive mid-operation interruptions • More free context enables better LLM reasoning—research and developer experience show performance degrades significantly as context windows fill • Anthropic’s recent context management features (context editing, memory tool) enable this more conservative approach • This represents the “infrastructure over features” paradigm—better performance through smarter resource management rather than maximizing utilization • Community reports and GitHub issues document both auto-compact behavior changes and corresponding Claude Code performance improvements • Key insight: sometimes improving AI tools means accepting what looks like inefficiency to maintain quality where it matters
Why Free Context Matters for Reasoning Quality
LLMs need working memory to reason effectively. When Claude processes information, it’s not just reading what’s in the context window—it’s actively using that space to develop responses, evaluate options, and construct output. As the context window fills, available working memory shrinks.
Research consistently shows that “optimizing Claude’s context window in 2025 involves context quality over quantity,” with performance degrading substantially as models approach their limits. The technical mechanism is straightforward: when most context space is consumed by conversation history, file contents, and tool outputs, the model has minimal room for the computational processes that produce high-quality responses.
Think of it like RAM on your computer. Sure, you can run programs until you hit 95% memory utilization. But that last 5% gets consumed by swapping, garbage collection, and system overhead—leaving nothing for actual computation. Your programs slow to a crawl despite having “only” 95% utilization.
LLMs work similarly. That “free” context space isn’t wasted—it’s where reasoning happens. When Claude Code hits 200k tokens of context, it’s not the reading that becomes problematic, it’s the writing. The model needs space to construct responses, evaluate code changes, plan multi-step operations.
The Historical Context Collapse Problem
Several weeks ago, Claude Code would frequently run sessions until context collapse became inevitable. Auto-compact was designed to “automatically summarize conversations when approaching memory limits,” but the system often triggered too late—sometimes lacking sufficient space to even perform the compaction process itself.
The pattern was frustrating: you’d be deep into a complex refactoring, making steady progress, then suddenly Claude Code would struggle. Responses would become generic, previous decisions would be forgotten, and code quality would noticeably degrade. Developers noted that “LLMs perform much worse when the context window approaches its limit,” describing how context becomes “poisoned pretty easily” during long sessions. I’ve experienced this firsthand—watching a productive session gradually deteriorate as context filled up, with the model starting to contradict earlier decisions or forget project-specific patterns it had been following consistently.
The GitHub issues tell this story. One critical bug report documented auto-compact triggering at 8-12% remaining context “instead of 95%+, causing constant interruptions every few minutes”. Another described context management becoming “permanently corrupted” after failed compaction attempts, with the system stuck showing “102%” context usage and entering infinite compaction loops.
The frequency of these reports—with issues receiving dozens of “+1” reactions and multiple developers describing identical symptoms—suggests widespread problems rather than isolated incidents. These weren’t edge cases; they were symptoms of a fundamental tension: maximizing context utilization vs. maintaining reasoning quality.
Anthropic’s Context Management Evolution
The turning point came with Anthropic’s September 2025 announcement of new context management capabilities. The introduction of “context editing” and the “memory tool” represented a systematic approach to solving the context exhaustion problem, with context editing automatically clearing stale tool calls while preserving conversation flow.
The technical implementation reveals the strategic shift. In a 100-turn web search evaluation, context editing enabled agents to complete workflows that would otherwise fail due to context exhaustion—while reducing token consumption by 84%. This reflects a significant architectural shift in how Anthropic approaches context management.
But the most telling detail appears in Anthropic’s evaluation metrics. Combining the memory tool with context editing improved performance by 39% over baseline, with context editing alone delivering a 29% improvement. These gains come from better context management, not better code generation models.
The documentation now explicitly recommends practices that would have been heretical months ago. Anthropic’s best practices guide suggests “using subagents to verify details or investigate particular questions, especially early on in a conversation or task, tends to preserve context availability without much downside in terms of lost efficiency”. Translation: delegate and distribute context load rather than cramming everything into one session.
Community Observations Align With Conservative Thresholds
The community has noticed something changed, even if they can’t pinpoint exactly what. Best practices guides now emphasize that “auto-compact is a feature that quietly consumes a massive amount of your context window before you even start coding,” with some reports showing the autocompact buffer consuming “45k tokens—22.5% of your context window gone before writing a single line of code”.
But the more interesting observation comes from debugging discussions. A detailed feature request noted that “the VSCode extension currently auto-compacts at ~25% remaining context (75% usage), reserving ~20% for the compaction process itself”. That aligns remarkably well with my Claude MPM observation showing 64% usage when Claude Code reported 10% until auto-compact.
If Claude Code is indeed triggering compaction at 75% utilization rather than 90%+, that leaves 25% of the context window (50k tokens in a 200k window) free for reasoning. That’s substantial working memory—enough space for the model to effectively plan, evaluate alternatives, and construct high-quality responses.
The performance impact shows up in usage patterns. While some users report “performance complaints, context limitations, and inconsistent outputs,” others note that “context window management creates perceived inconsistencies” and “regular history pruning and strategic context management often restore expected performance levels”.
The Completion Buffer: Room to Finish What You Started
Here’s another subtle improvement I’ve noticed: Claude Code now seems to have more wiggle room to complete tasks before triggering auto-compact. In the old days, you’d often hit compaction mid-operation—halfway through a refactoring, in the middle of implementing a feature, right when you needed the model to maintain full context.
This suggests the engineers built in a completion buffer—enough free space not just for the compaction process itself, but to allow the current task to finish gracefully. It’s the difference between:
Old behavior: Hit 90% context → Start new task → Run out of space mid-task → Force compact → Lose context about what you were doing
New behavior: Hit 75% context → Plenty of room for current task → Complete it successfully → Then compact with full understanding of what was accomplished
This isn’t just about when compaction triggers, but about giving the system enough runway to land the plane before resetting. The user experience difference is substantial. Instead of constantly fighting interrupted workflows, you now get clean task completion followed by reset.
I’ve noticed this directly: sessions that would previously hit compaction mid-refactoring now complete the refactoring cleanly, then compact. The model maintains full context about what it’s changing and why through to completion, rather than losing thread halfway through and having to reconstruct understanding from a summary.
That completion buffer—the gap between “starting to approach limits” and “actually hitting limits”—transforms context management from reactive crisis mode to proactive workflow optimization. You’re not scrambling to salvage a half-finished refactoring; you’re finishing work cleanly, then resetting for the next phase.
It’s infrastructure thinking applied to user experience: the best system management is invisible to users because it prevents problems rather than recovering from them.
The Auto-Compact Debate Reveals the Trade-off
The community remains divided on auto-compact itself, but that debate illuminates the fundamental tension. Some developers argue for disabling auto-compact entirely, noting “we already have better solutions for maintaining context across sessions: CLAUDE.md files capture your project’s patterns and standards, custom commands encode repetitive workflows”.
Others recognize the necessity but want control over when it happens. The core complaint: “when a task is 90% done, forced compaction wastes tokens and disrupts flow,” with users requesting manual control rather than automatic triggering.
What both camps agree on: auto-compact triggering “when the context window reaches approximately 95% capacity” is problematic, with users consistently “advising against waiting for auto-compact, as it can sometimes take a while”.
The resolution? Trigger earlier, preserve more working memory, and give the model room to think before hitting crisis mode.
Technical Explanation: Why Earlier Compaction Works
The counter-intuitive insight: stopping earlier actually extends productive session length. Here’s why.
When Claude Code runs until 90% context utilization before compacting:
Context window: 200k tokens total
Conversation + files + tools: 180k tokens
Free space for reasoning: 20k tokens
Compaction process overhead: 15-20k tokens
Result: Barely enough space to compact, frequent failures, degraded quality
When Claude Code stops at 75% context utilization:
Context window: 200k tokens total
Conversation + files + tools: 150k tokens
Free space for reasoning: 50k tokens
Compaction process overhead: 15-20k tokens
Result: Comfortable margins, successful compaction, sustained quality
The numbers tell the story, but the user experience is what matters. By stopping earlier, Claude Code actually enables longer effective sessions because each turn maintains higher reasoning quality. What feels like “wasted” context capacity—that unused 25%—turns out to be critical for maintaining the clarity and consistency that makes the utilized portion valuable. This aligns with the principle that “effective management isn’t just a nice-to-have—it’s essential for sustaining coherent, multi-turn conversations without the AI losing thread”.
The “Doing Less” Paradigm
This represents a fundamental shift in how we think about AI tool optimization. Developer instinct says maximize utilization—use every available token, run until you hit limits, squeeze maximum value from expensive compute resources. It feels wasteful to leave 50k tokens “unused.”
But that’s optimizing the wrong metric. The goal isn’t maximum context utilization; it’s maximum productive output. As Anthropic’s engineering team notes, “managing context in Claude Code is now a multi-dimensional problem: model choice, subagent design, CLAUDE.md discipline, thinking budgets, and tooling architecture all interact”.
I’ve watched this play out in my own sessions. Running Claude Code until it approaches 90% utilization produces more code per session in terms of raw output. But the quality deteriorates—more bugs slip through, architectural decisions become inconsistent, earlier project-specific patterns get forgotten. Sessions that stop at 75% utilization produce less total output but higher-quality, more maintainable code that actually ships.
The performance gains from conservative context management show up across multiple dimensions:
Response Quality: More working memory enables better reasoning about complex refactoring, architectural decisions, and edge cases.
Session Reliability: Earlier compaction prevents the context corruption loops that plagued previous versions.
Cognitive Load: Developers spend less time fighting context management issues and more time building features.
Cost Efficiency: Paradoxically, stopping earlier may reduce overall token consumption through fewer failed compaction attempts and fewer sessions requiring complete restarts.
Practical Implications for Developers
If this hypothesis is correct—that Claude Code performance improvements stem largely from more conservative context management—what should developers do?
1. Stop Fighting Auto-Compact
The old advice was to disable auto-compact and manage context manually. But Anthropic’s engineering guidance now suggests “do the simplest thing that works will likely remain our best advice for teams building agents on top of Claude”. If Claude Code is now triggering compaction at reasonable thresholds, let it work.
2. Use CLAUDE.md for Persistent Context
Rather than cramming everything into conversation history, “use a dedicated context file (like CLAUDE.md) to inject fundamental requirements every session. This is where core app features, tech stacks, and ‘never-forgotten’ project notes live”. This moves stable information out of the limited conversation window.
3. Leverage Subagents for Task Isolation
The best practice is to “divide and conquer with sub-agents: modularize large objectives. Delegate API research, security review, or feature planning to specialized sub-agents”. Each subagent gets its own context window, preventing any single session from approaching limits.
4. Monitor But Don’t Micromanage
Tools like Claude MPM provide visibility into actual context usage, helping you understand when sessions are approaching limits. But the key is knowing “when things get weird: is Claude getting confused or stuck in a loop? Don’t argue with it. Use /clear to reset its brain & start fresh”.
5. Accept That Less Can Be More
The hardest lesson: sometimes the path to better performance is artificial constraints. Stopping at 75% utilization feels wasteful—you’re leaving 50k tokens “unused.” But that free space enables the reasoning quality that makes the utilized tokens valuable.
Infrastructure Over Features, Again
This observation fits a pattern I’ve been documenting for months: the most important advances in AI development tools aren’t necessarily flashier features or bigger models. They’re better infrastructure.
Claude Code’s performance gains likely stem more from smarter context management, better memory systems, and more conservative resource utilization than from model improvements alone (though Anthropic did make Sonnet 4.5 substantially better at code generation).
Anthropic’s position is clear: “waiting for larger context windows might seem like an obvious tactic. But it’s likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns”. The solution isn’t more capacity; it’s better management of existing capacity.
This mirrors broader patterns in the AI tools market. Look at which tools developers stick with versus which ones generate initial excitement then fade. Cursor grabbed attention with its aggressive feature velocity, but many developers report returning to Claude Code for long-form work—not because of flashier features, but because sessions remain productive longer. The tools gaining staying power emphasize robust infrastructure for context management, memory persistence, error handling, and resource optimization over demo-ready feature lists.
The Broader Lesson
My Claude MPM observation—64% context usage when Claude Code reports 10% until auto-compact—suggests something important: current AI tool optimization isn’t about maximizing utilization. It’s about finding the sweet spot where resource constraints actually improve output quality.
This has implications beyond Claude Code:
For Tool Developers: Consider whether your optimization target is the right one. Maximum throughput isn’t always optimal if it degrades quality.
For Platform Providers: Infrastructure improvements that seem invisible to users (better context management, smarter resource allocation) often deliver more value than flashy feature additions.
For Developers: Learn to work with constraints rather than fighting them. The tools that enforce reasonable limits may actually be helping you.
For AI Research: The path to better AI assistance may involve more strategic limitations, not fewer.
Conclusion
I can’t definitively prove this hypothesis—that Claude Code’s performance improvements stem primarily from more conservative context management. The evidence is circumstantial: my Claude MPM observations showing the 64% vs 10% discrepancy, the completion buffer that now gives tasks room to finish before compacting, community reports of changed auto-compact behavior, Anthropic’s new context management features, and the well-documented relationship between free context and reasoning quality.
But the pattern is compelling enough to warrant attention. If the hypothesis holds, it represents a profound lesson about AI tool development: sometimes the best way to improve performance is constraining the system in smarter ways rather than pushing utilization limits.
The old approach: run until you can’t run anymore, then try to recover—often unsuccessfully. The new approach: stop early enough to maintain consistent quality throughout, with enough buffer space to complete tasks gracefully before resetting.
These changes—whenever auto-compact triggers, how much buffer space it preserves—may explain why Claude Code feels noticeably better lately. Not just faster or smarter, but more reliable and consistent. The sessions that used to deteriorate halfway through now maintain quality to completion. The forced compactions that interrupted complex refactorings now happen at logical breakpoints.
It’s worth noting: improving AI tools sometimes means accepting what looks like inefficiency. That “unused” 25-35% of context isn’t wasted—it’s working memory that enables everything else to function properly. Infrastructure thinking applied to user experience, where the best system management becomes invisible because it prevents problems rather than recovering from them.
I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on context management in AI development, read my analysis of Carrying Context or explore Claude-MPM’s approach to multi-agent orchestration.






The completion buffer insight is spot-on. I've been running sessions that used to hit compaction mid-refactoring, and now they finish cleanly befor resetting. What's interesting is how this mirrors the RAM analogy but for attention itself—that last 25% isn't idle capacity, its where the model actually constructs coherent responses. Makes me wonder if we've been optimizing for the wrong metric in AI tools across the board.