The Agent Unlock: Why Opus 4.5 Changed How I Work

I switched. That’s the short version.

Dec 18, 2025

For months, Sonnet 4.5 was my default. Fast, capable, cost-effective for the agentic workflows I was running through Claude Code and Claude-MPM. Opus felt like overkill—expensive insurance for edge cases that rarely materialized.

Then Anthropic dropped Opus 4.5 on November 24th. Within a week, I’d flipped completely. Now I reach for Opus 4.5 whenever it’s available. Sonnet gets the quick stuff. Opus gets everything that matters.

Here’s the thing: I’m not alone. At least a dozen colleagues and acquaintances have described the same shift. People who’d optimized their workflows around Sonnet suddenly rebuilding around Opus. Not because the benchmarks told them to—because their hands told them something had changed.

The Unlock Framing

Developer McKay Wrigley captured something real when he wrote: “GPT-4 was the unlock for chat, Sonnet 3.5 was the unlock for code, and now Opus 4.5 is the unlock for agents.”

That framing resonates because it matches what I’m experiencing. GPT-4 made conversational AI useful. Sonnet 3.5 (and later 4.5) made AI coding assistance genuinely productive. Opus 4.5 makes autonomous agents viable for serious work.

The difference shows up in session duration. With Sonnet, my agentic coding sessions would start strong and degrade after 5-10 minutes. Context drift. Forgotten constraints. The model losing the thread on multi-file refactors. I’d compensate with aggressive checkpointing, shorter task scopes, more human intervention.

With Opus 4.5? Twenty minutes of coherent, unsupervised work. Sometimes thirty. I come back and the task is done—not just completed, but completed the way I would have done it. Idiomatically. Without the weird patterns that screamed “AI wrote this.”

Adam Wolff at Anthropic described it perfectly: “When I come back, the task is often done—simply and idiomatically.” That’s exactly right.

The Numbers Back It Up

Opus 4.5 hit 80.9% on SWE-bench Verified. First model to break 80%. GPT-5.1-Codex-Max sits at 77.9%. Gemini 3 Pro at 76.2%.

But raw benchmark scores don’t explain the experience shift. Token efficiency does.

Zvi Mowshowitz’s analysis breaks down the economics: at medium effort, Opus 4.5 matches Sonnet 4.5’s best performance while burning 76% fewer tokens. At high effort, it beats Sonnet by 4.3 points using 48% fewer tokens. Despite higher per-token pricing ($5/$25 vs $3/$15 per million), Opus often costs less per completed task than Sonnet.

One analysis showed complex tasks costing $1.30 with Opus versus $1.83 with Sonnet. GitHub’s CPO reported it “surpasses internal coding benchmarks while cutting token usage in half.”

The agentic-specific benchmarks tell the real story. On MCP Atlas (scaled tool use), Opus 4.5 scores 62.3% versus Sonnet 4.5’s 43.8%. That 18.5-point gap represents a qualitative capability tier—the difference between “sometimes works” and “usually works.”

How My Workflow Changed

I’ve restructured around a simple principle: Opus for thinking, Sonnet for doing.

Complex architectural decisions? Opus. Multi-file refactors touching business logic? Opus. Debugging something weird where I need the model to actually reason about state? Opus.

Quick edits. Boilerplate generation. Well-defined single-file changes. Sonnet handles these fine.

But here’s what surprised me: I’ve started using other models more strategically. Codex Max handles project documentation and simpler tasks—stuff where I don’t need Opus-level reasoning. Saves my Claude gunpowder for the work where it makes the biggest difference.

I’m also building toward something bigger: integrating other LLM agents directly into Claude-MPM. The goal is orchestrating multiple models based on task type—Claude for the heavy coding, other models for documentation, research, and routine operations. Different tools for different jobs within the same workflow.

But the coding tool changed decisively. Claude owns that now.

One more thing I’ve noticed: I’m hitting token limits less frequently. Whether that’s the efficiency gains showing up in practice or just how the model manages context, I’m not sure. But the sessions feel longer before I need to reset.

The Challenge for OpenAI and Gemini

Here’s what makes this interesting from a competitive standpoint: Anthropic isn’t establishing a lead. They’re widening one.

Claude has dominated coding benchmarks since the 4 series dropped. Sonnet 4, then Sonnet 4.5, consistently outperformed GPT and Gemini on real-world coding tasks. The gap was already there. Opus 4.5 turned a lead into a chasm.

The combination seems to be reasoning depth plus tool use sophistication. Raw intelligence helps, but the way Opus 4.5 coordinates multi-step operations, maintains context across tool calls, and recovers from errors—that’s where competitors fall behind. OpenAI and Google need to answer both dimensions.

OpenAI’s response will be telling. Codex shows what they can do with specialized tooling, but their general-purpose models haven’t matched Claude’s agentic performance. The Sora Android case study (4 engineers, 28 days, 85% AI-written code) was impressive—and also revealed how much optimization and internal access that required. External developers face different economics.

Google’s position is more nuanced. Gemini 3 Pro genuinely excels at multimodal and reasoning tasks. But for pure coding workflows? The community consensus has shifted toward Claude. Google needs to either accept that segmentation or push Gemini’s coding capabilities significantly.

The pricing move matters too. Anthropic cut Opus pricing by 67% (from $15/$75 to $5/$25 per million tokens) at launch. That signals confidence. They’re not positioning Opus 4.5 as a premium boutique offering—they’re going for volume.

What Actually Improved

Talking to other developers who’ve made the switch, a few specific capabilities come up repeatedly:

Thinking block preservation across context turns. Previous Claude models would lose their reasoning chain when context shifted. Opus 4.5 maintains coherent thought across longer sessions.

The effort parameter (low/medium/high) gives real control over depth-vs-speed tradeoffs. Set it high for complex problems, low for quick iterations.

Memory tools that store information outside the context window. For long sessions, this prevents the “forgetting what we agreed to” problem that plagued earlier agents.

Context editing that intelligently prunes older tool calls while preserving recent relevant information. The model manages its own context better.

None of these are revolutionary in isolation. Together, they add up to agents that don’t lose the plot.

The Skepticism is Fair

Not everyone’s convinced. And the criticisms have merit.

Usage limits frustrate power users. Opus 4.5 requires Max tier ($100-200/month), and even then, heavy users hit limits. Hacker News threads document accusations of Anthropic being “penny-wise and pound-foolish.”

Hallucination rates remain concerning. Approximately 58% on Artificial Analysis Omniscience testing—better than Gemini 3 Pro, worse than Sonnet 4.5. For production code, you still need review.

The 200K context window trails GPT-5’s 400K. For massive codebases, that gap matters on paper. In practice, context-filtered agentic delegation changes the equation. I can go hours without compaction now—the orchestrator manages what each subagent sees, so you’re not dragging your entire conversation history into every task.

Some developers see minimal difference from Sonnet. Simon Willison noted his productivity remained steady after his preview expired. Not everyone experiences the same shift.

And the “nerf cycle” theory—that Anthropic degrades models post-launch—persists in community discussions. The evidence doesn’t support it, but the suspicion affects trust.

Enterprise Adoption Lags (As Usual)

While model capability crossed a threshold, enterprise deployment remains cautious. Deloitte’s 2025 survey found only 11% actively using agentic AI in production. 42% still developing roadmap strategies.

That’s normal for infrastructure shifts. Individual developers adopt fast. Teams take longer. Enterprises take years.

The signals point in one direction though. Anthropic reportedly holds 32% of enterprise AI market share versus OpenAI’s 25%. Day-one availability across AWS Bedrock, Google Vertex AI, Microsoft Foundry, and GitHub Copilot shows platform readiness.

Agentic coding will become standard. The only question is timeline.

Bottom Line

Six months ago, letting AI agents handle substantial coding work felt risky. Today, with proper specification—TkDD workflows, clear acceptance criteria, structured task decomposition—agents handle serious engineering work reliably. Every.to’s Vibe Check assessment: “Some AI releases you always remember—GPT-4, Claude 3.5 Sonnet—and you know immediately something major has shifted. Opus 4.5 feels like that.”

Opus 4.5 didn’t create that shift alone. But it accelerated it decisively. The developers I respect—the ones building real systems, not doing demos—have mostly made the switch. Or they’re planning to.

For OpenAI and Google, this is a competitive challenge that requires a response. Not because Opus 4.5 is perfect—it isn’t—but because Anthropic just established a new baseline for what agentic coding should feel like.

My workflow changed in a week. From Sonnet-first to Opus-first. From skeptical about Opus to building around it.

That doesn’t happen often. When it does, pay attention.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent orchestration, see my deep dive into Claude-MPM or my analysis of the tools shaping 2025 development workflows.

Discussion about this post

Ready for more?