Hyperdev: From The Trenches

Why Rust is the AI Language of the Future

Robert Matsuoka — Wed, 13 May 2026 12:03:01 GMT

A developer’s journey from Python to Rust reveals why the compiler, not the runtime, will define the next generation of AI systems.

Six months ago, I abandoned my Rust/Tauri-based writing app. Not because Tauri was bad—it wasn’t. Obsidian simply worked better for my needs. But that experience with Rust left an impression: the compiler was unlike anything I’d encountered. It didn’t just catch errors; it made entire classes of bugs impossible.

Then I built AI Commander for managing tmux sessions. Python would have been the obvious choice—async programming, process management, plenty of libraries. But I chose Rust, partly out of curiosity, partly because I needed rock-solid thread control. The result was revelation: code that compiled cleanly worked correctly, and performance was extraordinary.

Now I’m deep into replacing both mcp-vector-search and kuzu-memory with pure Rust implementations. My new trusty-search outperforms the Python predecessor by an order of magnitude, even though that version used compiled Python extensions. This isn’t an isolated case—it’s part of a fundamental shift happening in AI development.

Rust isn’t just another systems language. It’s becoming the infrastructure language of AI, and the reason has everything to do with letting the compiler handle the heavy lifting—especially as AI systems increasingly write their own code.

The Performance Reality Check

The performance gains are measurable and significant. Hugging Face’s tokenizers library, rewritten in Rust, delivers 10-100x speedups over pure Python implementations. Polars, the Rust-based DataFrame library, consistently outperforms pandas in data processing benchmarks. In embedded systems, Rust achieves 98% of C performance while eliminating memory safety issues entirely.

This isn’t just raw speed. Memory efficiency matters significantly in production AI systems. Where Python applications might consume gigabytes for large-scale data processing, equivalent Rust implementations often require 3-5x less memory. The difference compounds when deploying AI systems at scale.

This performance advantage stems from a fundamental philosophical difference in how Rust approaches the safety-speed trade-off.

When your AI system processes millions of sensor events per second in an autonomous vehicle, or makes split-second trading decisions with millions at stake, runtime failures aren’t just inconvenient—they’re catastrophic. Python’s flexibility, its greatest strength for research and experimentation, becomes a liability in these production environments.

The Compiler Does the Heavy Lifting

Here’s where Rust’s philosophy is revolutionary. Most languages force you to choose between safety and performance, between readable code and efficient code. Rust rejects this trade-off entirely through what it calls “zero-cost abstractions.”

The principle, borrowed from C++ but perfected in Rust, is elegantly simple: “What you don’t use, you don’t pay for. What you do use, you couldn’t hand code any better.”

Consider async programming, critical for AI systems handling multiple data streams. In Python, async operations carry runtime overhead—coroutine scheduling, context switching, memory allocation for tasks. In Rust, the compiler transforms your high-level async/await code into efficient state machines. No runtime scheduler, no hidden allocations, just bare-metal performance wrapped in readable syntax.

The genius is that the compiler absorbs the complexity. You write code that looks high-level and readable:

async fn process_ai_requests(stream: &mut DataStream) -> Result, Error> {
    let mut responses = Vec::new();

    while let Some(request) = stream.next().await {
        let response = ai_model.infer(request).await?;
        responses.push(response);
    }

    Ok(responses)
}

But the compiler generates assembly that rivals hand-optimized C. The ownership system prevents data races without locks. The borrow checker eliminates memory leaks without garbage collection. The type system catches logic errors that would be runtime failures in dynamic languages.

This is the heavy lifting: converting human-friendly abstractions into machine-efficient reality, all at compile time, with zero runtime cost.

This compiler-centric philosophy becomes even more critical as we enter an era where AI systems increasingly write their own code.

The AI Code Generation Protection Paradox

As AI-powered coding tools like Copilot, Claude Code, and Cursor become mainstream, a new challenge emerges: how do you ensure correctness when the code author might not fully understand what they’ve written?

When Claude or Copilot generates Python, Java, TypeScript, or C code, human oversight becomes the primary error-checking mechanism. Code reviewers need to catch type mismatches, memory leaks, race conditions, and logic errors that might not surface until production. As AI-generated code becomes more prevalent and complex, human review processes struggle to keep pace with the volume and complexity of generated code.

Think about it: an AI might generate a perfectly logical-looking concurrent data structure in Python that works flawlessly in single-threaded tests but creates subtle race conditions under load. A human reviewer might miss the threading implications entirely. The same AI generating Rust code? The compiler would reject it outright if the data sharing wasn’t provably safe.

Rust’s compiler fundamentally changes this dynamic. It doesn’t matter if the code was written by a human, an AI, or a collaboration between both—if it compiles, entire categories of dangerous bugs simply cannot exist. Memory safety violations? Caught at compile time. Data races in concurrent code? Impossible. Use-after-free errors? Eliminated by the ownership system.

This creates a fascinating inversion: Rust may be the best language for AI-generated code precisely because it doesn’t trust the programmer. While other languages rely on developer discipline and code review to catch errors, Rust enforces correctness through the type system and ownership model. The compiler becomes an automated, exhaustive code reviewer that never gets tired, never misses subtle bugs, and never waves through “probably fine” code.

As Microsoft research demonstrates, 70% of security vulnerabilities stem from memory safety issues that could be eliminated at compile time. When AI systems generate infrastructure code, this shifts from a productivity concern to an existential safety issue.

The compiler isn’t just doing heavy lifting for performance; it’s providing safety guarantees that scale beyond human oversight.

Memory Safety in the Age of AI

Rust’s creator, Graydon Hoare, designed the language after getting stuck in a broken elevator caused by software crashes. His goal was simple: write fast, small code without memory bugs. That mission has become critical as AI systems move from labs to production infrastructure.

In traditional software, memory safety issues mean patches and updates. In AI systems controlling physical infrastructure—autonomous vehicles, medical devices, financial trading systems—they can mean catastrophic failure.

Rust’s ownership model doesn’t just prevent crashes; it eliminates entire categories of bugs that plague concurrent AI workloads. While C++ applications in multi-threaded environments regularly exhibit race conditions during testing, Rust’s compile-time guarantees make such issues structurally impossible.

But here’s the crucial part: this safety comes at zero runtime cost. Unlike garbage-collected languages that trade performance for memory safety, Rust enforces safety through compile-time checks. Your production AI system runs with the performance of C but the safety guarantees that traditional systems languages can’t provide.

The Python-Rust Symbiosis

Here’s where the narrative gets interesting. The emerging trend isn’t Rust replacing Python—it’s Rust and Python working together. The PyO3 bridge, Rust-powered Python tools like Ruff and uv, and the growing number of “Python API, Rust engine” libraries show that the two languages are becoming symbiotic.

Python remains dominant for AI research and experimentation. The ecosystem is unmatched: PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers. PyTorch’s 100,000+ GitHub stars reflect an ecosystem that won’t disappear overnight. When you’re prototyping models, analyzing data, or building internal tools, Python’s flexibility and ecosystem make it irreplaceable.

But production tells a different story. When performance, memory usage, and reliability matter—when you’re building the infrastructure that powers AI systems rather than the models themselves—Rust increasingly dominates.

The pattern emerging across tech giants is clear: Python for the interface layer, Rust for the infrastructure layer. Experimentation happens in Jupyter notebooks; production happens in compiled Rust binaries.

Looking Forward: Infrastructure vs. Experimentation

This division isn’t arbitrary—it reflects the maturation of AI from research curiosity to critical infrastructure. Research requires flexibility, rapid iteration, and access to cutting-edge libraries. Production requires reliability, performance, and safety guarantees.

Rust excels at infrastructure because the compiler handles the complexity of building robust systems. Memory safety, concurrency, error handling—all the concerns that make production systems hard to build correctly—are enforced by the type system rather than relying on developer discipline.

Consider the trajectory: Microsoft aims to eliminate C and C++ from their codebase by 2030, replacing it with Rust. The Rust Foundation officially positions the language as ideal for “ultra-reliable AI systems.” Major tech companies are increasingly choosing Rust for performance-critical AI infrastructure. These aren’t experimental projects—they’re billion-dollar bets on where AI infrastructure is heading.

The Compiler Advantage

What makes Rust uniquely suited for AI isn’t just performance or safety—it’s the fundamental approach of front-loading complexity into the compiler rather than dealing with it at runtime.

In AI systems, runtime failures are exponentially more expensive than compile-time failures. A training job that crashes after days of computation. An inference system that memory-leaks during peak traffic. A edge AI device that locks up in a safety-critical situation. These failures cascade through systems in ways that research environments rarely encounter.

Rust’s compiler is exhaustive in a way that benefits AI development specifically. It catches data races in concurrent model training. It prevents buffer overflows in tensor operations. It enforces proper error handling in distributed inference systems. It does this without performance overhead, without runtime monitoring, and without hoping that comprehensive testing caught every edge case.

The compiler becomes your co-pilot in building reliable AI infrastructure.

The Personal Proof

My own experience validates this thesis. AI Commander needed complex thread management for tmux session control—exactly the kind of concurrency that’s error-prone in traditional languages. Rust’s ownership system made data sharing across threads not just safe, but intuitive. The code that compiled was correct.

With trusty-search, the performance gains over mcp-vector-search weren’t just about Rust being “faster than Python.” They were about Rust enabling architectural choices—zero-copy string processing, efficient memory layouts, fine-grained concurrency control—that would be risky or impossible in garbage-collected languages.

These aren’t micro-optimizations. They’re fundamental design differences that become critical at scale.

The Real Limitations of Rust for AI

Before declaring Rust the future, it’s essential to acknowledge where it falls short compared to Python in AI contexts.

The learning curve is steep. Rust’s ownership model requires a fundamental mental shift that can slow initial development. Developers comfortable with garbage-collected languages often struggle with borrowing rules, especially when building complex data structures. The compiler’s strict requirements, while beneficial long-term, can feel obstructionist when prototyping.

The ecosystem gap remains significant. While Rust has impressive performance libraries, Python’s AI ecosystem is vast and mature. TensorFlow, PyTorch, scikit-learn, and thousands of specialized ML libraries have no Rust equivalents. Building production ML pipelines often requires library combinations that simply don’t exist in Rust.

Compilation time can kill iteration speed. Where Python allows instant feedback during development, Rust’s compilation process—especially for complex projects—can introduce friction that slows experimentation. This matters significantly in AI research where rapid iteration drives discovery.

Talent pool challenges are real. Finding experienced Rust developers is substantially harder than finding Python developers. The language’s complexity means onboarding takes longer, potentially impacting team velocity and project timelines.

Not every AI workload benefits from Rust’s strengths. Data analysis, model training with existing frameworks, and one-off scripts often don’t require the safety and performance guarantees Rust provides. Using Rust for these tasks can be overkill that reduces productivity without meaningful benefits.

These limitations explain why the Python-Rust symbiosis model makes sense: leverage each language where it excels rather than forcing one-size-fits-all solutions.

The Future is Compiled

As AI systems become infrastructure—as they move from experimental notebooks to production systems handling millions of users—the languages that power them will need to provide the guarantees that infrastructure requires.

Python will continue to dominate the experimental and interface layers. But the backbone, the high-performance inference systems, the real-time edge deployments, the safety-critical applications—these will increasingly be built in languages where the compiler, not the runtime, ensures correctness.

Rust represents a fundamental shift in how we approach building reliable systems: moving complexity from runtime to compile time, from testing to proving, from hoping to knowing.

This shift becomes critical as we enter the era of AI-generated code. When AI systems are writing the infrastructure that runs other AI systems, traditional human oversight breaks down. We need tools that can verify correctness automatically, at compile time, without relying on human reviewers to catch subtle but catastrophic errors.

The compiler does the heavy lifting so production systems can focus on their actual job: running AI workloads reliably, safely, and at scale. In a world where AI infrastructure is becoming as critical as power grids and financial networks—and where that infrastructure is increasingly written by AI itself—that guarantee isn’t just nice to have, it’s existential.

The age of agentic AI isn’t just about better models. It’s about building the reliable, high-performance infrastructure those models need to operate in the real world, written by AI systems that can’t be trusted to write safe code in traditional languages. And increasingly, that infrastructure is being built in Rust, one compile-time guarantee at a time.

Bob Matsuoka writes about the intersection of software engineering and AI at HyperDev. His latest Rust projects replace Python infrastructure with performance gains that would make even a compiler blush.

AI Memento

Robert Matsuoka — Mon, 11 May 2026 12:30:11 GMT

Leonard Shelby can’t form new memories. In Memento, after the attack, every fifteen minutes or so his short-term recall resets. He has a problem most engineers would recognize: a fast working scratch space and no durable storage. So he builds a system. Polaroids with handwritten captions on the back. A wall of pinned notes. Tattoos for the load-bearing facts — the ones he can’t afford to lose, can’t afford to misfile, can’t afford to look up wrong.

What he doesn’t do is try to remember everything. He doesn’t carry a thicker notebook. He builds infrastructure to find what matters, fast, when he needs it. The pocket holds what’s relevant right now. The system decides what makes it into the pocket.

I’ve been thinking about Leonard a lot this year, because the prevailing argument in my corner of the internet is that LLMs no longer need him. The pocket got bigger. Frontier models can hold a million tokens of context. Token prices fell. So just stuff the pocket. Skip the indexing. Skip the tattoos. Read everything.

I think that’s the wrong read of where we’re going.

The real argument, fairly stated

The “RAG is dead” position isn’t crazy. It goes roughly like this: a handful of frontier models — Gemini 1.5, Claude’s extended context tiers — can now handle contexts approaching 1M tokens without choking. Token prices have fallen to around $0.60 per million for several frontier models. Standing up a real retrieval pipeline — chunking strategy, embedding model, vector store, hybrid scoring, reranker, eval harness — is 40 to 80 engineering hours, easily. For a small, static corpus you query a few thousand times a month, just dumping the whole thing into context with prompt caching is simpler and arguably cheaper.

Claude Code is the existence proof people point at. It doesn’t ship a vector database. It uses ripgrep, file globs, and reads files into context. And it works pretty well. So why bother with the rest?

I’ll concede the envelope. For a 300K-token corpus, single tenant, low query volume, mostly static — yeah, stuff it. Cache it. Read it. Don’t build a search stack to feel sophisticated. That’s a real position and I don’t want to strawman it.

But that envelope isn’t where most of the work lives. And the argument the loud version makes — that grep replaces search, that long context replaces retrieval — confuses two different things; there are cracks in that argument.

The pocket problem

Here’s the first crack. “Lost in the middle.” Liu et al., 2023, TACL: when you stuff long contexts, model performance forms a U-shape. Strong recall at the beginning, strong recall at the end, degraded recall in the middle. The 2026-gen frontier models have improved on this — early evals suggest the curve is flatter — but it’s not gone.

So the first thing the bigger pocket buys you is the ability to put a lot of important stuff exactly where the model is most likely to underweight it. You can’t reorder a 700K-token dump by relevance unless you’ve already done retrieval. Which means the question doesn’t go away — it just hides.

Second crack: cost. RAG-style queries — retrieve relevant chunks, feed 8–16K of context — run around $0.005–$0.008 per request at $0.60/million input tokens, when you count the full prompt. A naive full-context pass against a 500K-token corpus runs $0.30 in input tokens alone, before output. That’s a 40–60x ratio at list price, and it widens if you have high cache-miss rates or large output windows. At 50K daily queries — not enormous, this is a mid-sized internal tool — you’re choosing between a few hundred dollars a day and tens of thousands. That’s not a rounding error you absorb. That’s a re-platforming decision that wakes someone up.

Third crack, and this is the one that actually makes me grumpy. Grep is not search. Grep is lexical pattern matching. Searching for auth_token finds the string auth_token. It finds it in commented-out dead code from 2023. It finds it in test fixtures with hardcoded sample values. It finds it in a vendor library you don’t even own. It misses the function called refresh_session_credential that does what you’re actually looking for, because the words are different. Vector search finds that function. Grep doesn’t. They’re not the same tool. Pretending they are is how you ship the wrong fix at 2 AM.

When people say “Claude Code uses grep, so search is over,” they’re describing a tool optimized for a bounded, file-system-structured corpus where the author controls the index shape. That’s a legitimate architecture for that problem. It does not generalize to “ten million product images” or “a 30-year codebase across 800 repos” or “tenant-scoped documents under per-row access control.” Long context can’t replace a knowledge graph, either. You can serialize a graph into text — flatten the edges, encode the relationships — but then you’ve lost typed edges and efficient query-time traversal. You can’t join across it. You can’t reason about edge types. You’d need to flatten the graph to feed it in, at which point you’ve discarded the structure that made it useful.

With me so far? The argument isn’t “long context is bad.” Long context is great. The argument is that long context is a consumer of retrieval, not a replacement for it.

What good retrieval actually looks like

A modern retrieval stack, the kind that holds up under real query volume, isn’t one thing. It’s a layered pipeline.

Lexical first — BM25, TF-IDF, the unfashionable stuff. It’s still excellent at exact tokens, identifiers, error messages, version numbers. The things vector embeddings smudge.

Dense second — vector search over embeddings. all-MiniLM-L6-v2 is the workhorse at 384 dimensions. Catches semantic neighbors, paraphrases, the function-named-different-but-doing-the-same-thing case.

Fuse them — Reciprocal Rank Fusion at k=60 is the standard, and it’s standard because it works. You don’t pick BM25 or vector. You merge their rank lists.

Rerank — a cross-encoder over the top-k from RRF. Slower per-pair, but you only run it on the candidates that made it through the cheap layers.

The numbers from a 2026 benchmark pass I was reading — “From BM25 to Corrective RAG,” April 2026 — make the layered case cleanly:

Dense vectors alone: Recall@5 of 0.587
BM25 alone: 0.644
Hybrid via RRF: 0.695
Hybrid + neural reranker: 0.816

That last number is 39% better than dense-only. It’s also 17% better than RRF without the reranker. The layers compound. In systems that need both recall and precision at scale, you usually want all four layers. And — this part matters — none of them are replaced by a bigger context window. You still have to pick what goes into the pocket.

There’s a piece on top of this, too: intent. A query like “where is process_payment defined” is not the same shape as “how does the refund flow handle partial captures.” The first is a definition lookup; BM25 should dominate, you want exact identifier matching, you don’t want semantic neighbors fuzzing the result. The second is conceptual; dense embeddings should dominate, you want the chunk that explains the flow even if the words don’t match. Treating those queries identically is how you get plausible-but-wrong answers.

The EMNLP 2024 “Self-Route” paper made a related point I keep coming back to: let the model itself decide whether a query needs retrieval or full-context, based on complexity. They got better accuracy AND lower compute cost. The two approaches aren’t enemies. They’re complements with different cost/quality envelopes, and a working system uses both.

(A note on scope: I’ve been using “search” and “memory retrieval” somewhat interchangeably here, which isn’t quite right. They’re the same pipeline mechanics — retrieve, rank, inject — but different problems. Memory retrieval has a temporal model search doesn’t: recency matters, decay matters, and the source is agent history rather than a document corpus. Memory is also primarily graph-based, because what you’re trying to recover is relationships and context across time, not just chunks of text. For the purposes of this argument — context window vs. retrieval discipline — the distinction doesn’t change anything. But it’s worth naming.)

What I’m building

I have two projects in this space. I’ll be specific about what each one is and isn’t, because the field is full of demos that don’t survive contact with real corpora.

mcp-vector-search is the older one. Python, per-project, designed to live next to a codebase as an MCP server. The retrieval core is hybrid: BM25 plus HNSW over MiniLM embeddings, with knowledge-graph expansion built from tree-sitter AST parsing — function definitions, call edges, import edges, type relationships. Fifteen-plus edge types in the graph. Cross-encoder reranking on top of RRF. MMR for diversity so you don’t get five near-duplicates in the top results. Temporal decay weighted by git blame age. Cyclomatic complexity factored into ranking, because dense, gnarly code is more often what you’re looking for than the trivial wrappers around it.

trusty-search is what I’m building toward. Rust daemon, machine-wide rather than per-project, multi-tenant across all my work. Same retrieval core — RRF at k=60, MiniLM embeddings, BM25 plus HNSW — but with intent classification on the front. Queries get tagged Definition, Usage, Conceptual, or BugDebt, and each intent has pre-tuned alpha/beta weights between lexical and dense. Sub-10ms p50 on warm queries.

I’ll be honest: trusty-search is early-stage. Some of the features I just listed are scaffolding with stubs underneath. The intent classifier is rule-based at the moment and needs to become a small model. The two tools are complementary rather than competing for now — mcp-vector-search is the deeper analysis layer, trusty-search is the fast facts store you hit constantly during a session — though the long-term plan is for trusty-search to replace mcp-vector-search.

The reason I’m building both is that I keep running into the same wall: I can give an agent a million-token window and it will still ask the wrong question of the wrong file. The bottleneck is not how much I can stuff in. The bottleneck is which 8K of the 800K matters for this query. That’s a search problem. It’s been a search problem for sixty years. It’s still a search problem.

Back to Leonard

The reason Leonard works as an analogy isn’t the amnesia. It’s the discipline. He decides what makes it onto a polaroid. He decides what gets tattooed and what stays loose. He writes “DON’T BELIEVE HIS LIES” on the back of a photo because he won’t have the context to evaluate trustworthiness later — so he encodes the conclusion now, into a system he’ll find when he needs it.

The context window is the pocket. Whatever Leonard pulls out and looks at right now. It can be enormous. It can be cached. It can be cheap. None of that decides what goes in.

Retrieval is the tattoo on his wrist. The polaroid pinned to the wall. The system that decides which fact survives, which one is one query away, which one is buried. The job of that system is exactly the job that doesn’t go away when the pocket gets bigger.

The question was never “do I need RAG?” That framing turned a design discipline into a vendor category and made it easy to dismiss. The question is the one Leonard asks every morning: what do I actually need to find, and have I built the thing that finds it? At a million tokens, you’re not freed from that question. You’ve just made it more expensive to answer incorrectly.

It feels crazy that I wrote this only a year ago…

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Context Memory and Search: The Secrets to Effective Agentic Work — Why context management, memory, and retrieval are the three pillars of effective agentic systems
What’s In Your Second Brain? — Tooling for the modern CTO: how to build external memory that actually works
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Fear and Loathing in AWS

Robert Matsuoka — Thu, 07 May 2026 11:31:07 GMT

TL;DR: I spent years avoiding AWS due to its overwhelming complexity, preferring the simplicity of Vercel and composable stacks. Then Claude MPM changed how I work with AWS. Now I’m running sophisticated multi-service AWS deployments with GPU instances, comprehensive monitoring, and infrastructure-as-code—all managed through AI-assisted tooling. The lesson: agentic approaches transform ops workflows just as profoundly as they do coding.

Remember when AWS felt like digital quicksand? Every innocent “let me just deploy this simple app” spiraled into an afternoon lost in IAM policies, VPC configurations, and security group rules that made no sense. I’d start with what should be a five-minute deployment and emerge three hours later, bleary-eyed, with a working application and absolutely no confidence I could recreate the process.

That’s why I became a Vercel evangelist. One git push, automatic deployments, zero configuration. The developer experience was everything AWS wasn’t: predictable, fast, and actually enjoyable. For most projects, this composable stack approach—Vercel for frontend, managed databases, serverless functions where needed—delivered exactly the right balance of power and simplicity.

But somewhere along the way, that changed.

The Claude Code/MPM Turning Point

Today I’m running the kind of infrastructure I would have delegated to an ops team a year ago. GPU instances for ML workloads. Multi-AZ deployments across six subnets. Sophisticated monitoring pipelines that integrate CloudWatch metrics, Cost Explorer analysis, and automated GitHub issue creation. Terraform managing multi-account infrastructure with cross-service dependencies.

The difference? I don’t even need AWS’s Q assistant. I have something better: purpose-built AWS skills in Claude MPM that handle service deployment, infrastructure analysis, and operational workflows.

This transformation illustrates something crucial about the agentic revolution: it’s not just changing how we write code. It’s fundamentally altering how we approach operational complexity.

The Infrastructure Reality Check

Let me show you what I mean with real numbers. Here’s what I’m actually running across three active projects to support internal tools:

CloudWatch Reporting Service (Serverless):

12 Lambda functions handling health checks, metrics aggregation, and MCP server functionality
API Gateway HTTP API with sophisticated CORS and authentication
DynamoDB tables for state management and external directory lookups
SNS/SQS for alerting and dead letter queue handling
Direct Bedrock integration with Claude 4.5 Haiku for automated analysis
CloudWatch Events scheduling 5-minute monitoring cycles
Secrets Manager for GitHub app credentials and API keys

Code Intelligence Platform (Compute):

Two EC2 instances: t3.xlarge for web serving, g4dn.xlarge for GPU-accelerated indexing
EBS volumes with gp3 storage and custom IOPS configuration
VPC with six subnets across availability zones
Application Load Balancer with Route53 DNS and ACM certificates
EFS for shared file storage across instances
CloudWatch Synthetics for endpoint monitoring
Lambda-based Slack notifications triggered by SNS topics

Enterprise Infrastructure (Multi-Account):

Terragrunt-managed infrastructure across production and staging accounts
S3 backend for Terraform state with DynamoDB locking
Cross-account IAM policies and service integration
Integration with external providers (Sentry, GitHub, Kubernetes clusters)

A year ago, this list would have been my personal infrastructure horror story. Today, it’s Tuesday.

What Changed

The transformation wasn’t gradual. It was a step function that happened when I realized AI assistants could handle the cognitive overhead that makes AWS painful.

Before: AWS documentation as archaeological expedition. Digging through service guides, trying to understand the relationship between VPC route tables and security groups, wondering if I need an Internet Gateway or a NAT Gateway or both. Every deployment felt like solving a puzzle where half the pieces were hidden.

After: Natural language infrastructure requests. “Set up monitoring for this Lambda function with alerting to Slack” becomes a series of guided steps where the AI handles the AWS-specific implementation details while I focus on the business requirements.

The key insight is that AWS’s complexity isn’t inherently bad—it’s just cognitively expensive. When you remove that cognitive load through AI assistance, you can appreciate what all those services actually enable.

Take my monitoring setup. Previously, I would have settled for basic uptime checks because configuring comprehensive CloudWatch metrics, Cost Explorer integration, and automated issue creation felt like a weekend project. With claude-mpm AWS skills, it became an hour of guided configuration that resulted in production-grade observability.

Or consider the GPU instance management. The g4dn.xlarge for ML indexing runs sophisticated start/stop automation, monitors for runaway processes, and automatically scales EBS volumes based on data requirements. Setting this up manually would have required deep expertise in EC2 lifecycle management, CloudWatch alarms, and Lambda automation. With AI assistance, I focused on defining the business logic while the tooling handled the AWS implementation.

The DX Philosophy Still Matters

None of this means AWS wins every comparison. Vercel’s developer experience remains superior for the use cases it targets. When I need to ship a marketing site or a straightforward web application, git push deployment still beats any infrastructure-as-code workflow.

The difference is recognizing when complexity serves a purpose versus when it’s just complexity. Vercel abstracts away infrastructure concerns because most web applications don’t need granular control over compute, storage, and networking. But when you’re building systems that do need that control—ML pipelines, high-throughput data processing, complex service topologies—AWS’s granularity becomes valuable rather than burdensome.

AI assistance changes the cost-benefit calculation. When configuring VPC networking takes 20 minutes of guided conversation instead of three hours of documentation archaeology, you can choose AWS for projects where you previously would have compromised on requirements to avoid operational overhead.

But there’s an honest accounting problem buried in that logic. Claude Code isn’t free. API costs, subscription fees—if you’re running significant conversation volume to figure out your infrastructure, you’re spending real money. At some point, you’re spending more on AI assistance than a Vercel seat would cost. The “AWS saves money at scale” argument gets complicated fast when you factor in the cognitive tooling required to get there.

So let me be direct about where each wins. Pure self-service developer experience—one engineer, a web app, ship it fast? Vercel, and it’s not particularly close. The moment you need an AI co-pilot to configure your infrastructure, you’ve added a cost layer that Vercel eliminates by design. But complex multi-service deployments—ML pipelines, GPU compute alongside serverless, multi-account Terraform, monitoring infrastructure that spans six services—those don’t live in Vercel’s world. That’s where the math inverts and AWS earns its complexity premium.

The Broader Implications

This transformation reveals something important about how agentic approaches will reshape technology adoption. We’re not just making individual tasks more efficient—we’re changing which categories of tools become accessible to developers.

I see this pattern across the infrastructure stack. Database migrations and performance tuning become approachable when AI translates business requirements into specific configuration changes. Kubernetes stops being “too complex for small teams” when you can describe desired behavior in natural language and get helm charts and operators generated automatically. IAM policies, security groups, and compliance frameworks become manageable when AI can analyze your application requirements and generate least-privilege configurations.

These tools were always powerful. They were just too expensive to learn and maintain for many use cases. AI assistance changes that economics.

Where This Goes Next

We’re still early in AI-assisted infrastructure management. Today’s tooling handles deployment and configuration. Cost optimization, security posture, performance tuning—those are coming. Full system architecture from high-level requirements is probably further out than the hype suggests, but it’s not science fiction.

But the fundamental lesson remains: complexity isn’t always the enemy. Sometimes it’s just temporarily inaccessible. When AI removes the accessibility barriers, you can choose tools based on their actual capabilities rather than their learning curves.

For now, I’m running infrastructure that would have seemed impossible to manage solo a year ago. And it’s kind of fun.

AWS might still feel like quicksand sometimes. But now I have a helicopter.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

You weren't imagining things...Claude Code was dumber this month

Robert Matsuoka — Fri, 24 Apr 2026 13:53:25 GMT

So if you’ve been using Claude Code and noticed it felt... off... you weren’t imagining it.

Anthropic published a full breakdown yesterday and it’s actually three separate bugs that compounded into what looked like one big degradation. The developer community was right to be concerned, and the evidence they collected was instrumental in getting this fixed.

Here’s what actually happened:

1. They silently downgraded reasoning effort (March 4)

They switched Claude Code’s default from high to medium reasoning to reduce latency. Users noticed immediately. They reverted it on April 7.

Classic “we know better than users” move that backfired. From their postmortem:

“This was the wrong tradeoff. We reverted this change on April 7 after users told us they’d prefer to default to higher intelligence and opt into lower effort for simple tasks.”

The UI was appearing frozen in high reasoning mode, so they made an executive decision to sacrifice quality for speed. Developers immediately felt the difference and pushed back hard.

2. A caching bug made Claude forget its own reasoning (March 26)

This one was particularly insidious. They tried to optimize memory for idle sessions—clear old thinking after an hour of inactivity to speed up resumption. Sounds reasonable, right?

A bug caused it to wipe Claude’s reasoning history on EVERY turn for the rest of a session, not just once. So Claude kept executing tasks while literally forgetting why it made the decisions it did.

The cascading effects were brutal:

Every request became a cache miss
Usage limits drained faster than expected
Claude appeared “forgetful and repetitive”
Sessions felt like they were constantly resetting

3. A system prompt change capped responses at 25 words between tool calls (April 16)

They added this seemingly innocent instruction: “keep text between tool calls to 25 words. Keep final responses to 100 words.”

It caused a measurable 3% drop in coding quality across both Opus 4.6 and 4.7. They caught this through ablation testing—removing the instruction and measuring the performance difference.

Reverted April 20.

The community evidence was damning

While Anthropic was investigating internally, the developer community was building their own case. Stella Laurenzo from AMD’s AI group published the most comprehensive analysis—6,852 Claude Code sessions and over 234,000 tool calls.

Her findings:

Median visible thinking length collapsed 73% (2,200 → 600 characters)
API calls per task spiked up to 80x from February to March
Claude was choosing “simplest fix” over correct solutions

BridgeMind’s testing showed Opus 4.6 accuracy dropping from 83.3% to 68.3%.

The data was undeniable.

The perfect storm effect

Here’s what made this particularly hard to pin down: all three bugs affected different traffic slices on different schedules. The combined effect looked like random, inconsistent degradation.

Hard to reproduce internally. Hard for users to isolate the exact cause. It just felt... wrong.

Some sessions hit the reasoning downgrade. Others hit the caching bug. The unlucky ones hit multiple issues simultaneously. No wonder it seemed like Claude was having random bad days.

What this reveals about AI product development

This postmortem is actually refreshing in its transparency. Most AI companies would have quietly fixed the issues and moved on. Anthropic owned the mistakes publicly.

But it also highlights a fundamental tension in AI product development: users often prefer maximum capability over convenience optimizations. The reasoning effort downgrade was done for user experience (reduce perceived latency), but developers would rather wait for better output.

The lesson: don’t optimize away what users value most without asking them first.

All fixed now (v2.1.116)

As of April 20, all three issues are resolved:

Default reasoning is now “xhigh” for Opus 4.7, “high” for others
Caching bug squashed
Verbosity limits removed
Usage limits reset for all subscribers

Anthropic is also committing to more transparency going forward with a dedicated @ClaudeDevs account for deeper technical communication with developers.

The community was right to raise hell about this. And Anthropic’s response—full transparency with concrete fixes—sets a good precedent for how AI companies should handle quality regressions.

Your coding assistant is back to full strength.

Independent Validation

The technical analysis backing this story comes from multiple independent sources. Stella Laurenzo’s comprehensive audit of 6,852 sessions provided the quantitative foundation. BridgeMind’s testing offered controlled benchmark data. These weren’t isolated complaints—they were systematic investigations with reproducible findings.

When a company publishes a detailed postmortem acknowledging specific engineering decisions that degraded their product, and that postmortem aligns with community-gathered evidence, we’re seeing transparency in action. The developer community did the work to document the problems. Anthropic owned the solutions.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Opus 4.6 vs 4.7: The Real Cost of Incremental AI Improvements

Robert Matsuoka — Wed, 22 Apr 2026 11:31:49 GMT

Opus 4.7 dropped last week. Lots of excitement. Then the other shoe dropped. I ran identical coding tasks against both Opus 4.6 and 4.7 to see if the capability improvements justify the cost increase. Both models passed all 10 tests. The quality difference is real — but Opus 4.7 consumed 3.6× more tokens and cost 3.6× more for the same outcome.

That’s not a typo. Same task, same success rate, nearly 4× the cost. I have the receipts.

Anthropic says “pricing unchanged” because the per-token rates stayed the same. What they don’t mention is that Opus 4.7 systematically burns through more tokens to complete identical work. The model writes, then revises. Opus 4.6 writes correctly the first time. Both approaches work. Only one bills you for the revision process.

TL;DR

Both models passed 10/10 tests: Quality improvements are measurable but incremental (better typing, more thorough code)
3.6× cost increase for identical outcomes: $0.38 vs $1.38 for a 30-minute coding task in controlled testing
Token consumption drives cost, not capability: Opus 4.7’s iterative working style consumes 2.9× more output tokens per task
Agentic mode required: One-shot testing shows 4.7 fails 9/10 tests without tool access, while 4.6 passes perfectly
Per-token rates unchanged, real bill moved anyway: 4.7 burns 4.8× more cache tokens per task — the rate card stays flat, your invoice doesn’t

The Controlled Test

Last week I ran a head-to-head benchmark using the same Level 1 coding task for both models: build a complete Python CLI tool (Markdown Table Formatter) from scratch with full test coverage.

Test setup:

Framework: claude_agent_sdk v0.1.64, full agentic mode
Models: claude-opus-4-6 vs claude-opus-4-7
Success criteria: Pass all 10 provided pytest tests
Execution: Concurrent runs with identical prompts

Both models succeeded. The difference was entirely in how they got there.

The Numbers

Metric Opus 4.6 Opus 4.7 Ratio Wall clock time 114.8s 259.1s 2.3× slower Agent turns 17 23 35% more Output tokens 6,384 18,289 2.9× more Cache read tokens 215,853 1,034,165 4.8× more Total cost $0.38 $1.38 3.6× more expensive

The Behavioral Fingerprint

The tool usage patterns reveal why 4.7 costs more:

Tool Opus 4.6 Opus 4.7 Write 6 7 Bash 6 9 Read 4 1 Edit 0 5

Opus 4.7 made 5 Edit calls to revise files after writing them. Opus 4.6 made zero — it wrote all 6 source files correctly in a single pass, ran pytest once, passed 10/10 tests.

The cache token burn (4.8× more) suggests 4.7 does extended internal reasoning between each tool call. It’s thinking harder, which shows up in better code quality — more type hints (35 vs 25 function definitions), more thorough coverage (820 vs 471 lines of code). But you pay for that thinking process.

Quality vs Cost Trade-off

The output quality difference is genuine. Opus 4.7’s code was more defensively written — better typed, more thorough on edge cases. When you’re in the middle of debugging a genuinely hairy distributed system problem or making an architectural call with real downstream implications, that extra care is worth something.

But for this Level 1 coding challenge, both approaches delivered identical functionality. The question becomes: is 40% better typing and 74% more comprehensive coverage worth 260% higher costs?

When the Premium Isn’t Optional

I tested both models in one-shot mode (no tools, single response) to see if you could avoid the iterative cost overhead.

Metric Opus 4.6 Opus 4.7 Output tokens 4,725 23,907 Cost $0.27 $0.94 Tests passed 10/10 1/10

Opus 4.7 failed catastrophically without tool access. It generated 5× more tokens but couldn’t follow output format instructions — most files were unparseable and 9 of 10 tests failed. Opus 4.6 passed perfectly on the first attempt.

This reveals a structural dependency: Opus 4.7’s quality advantages require the full agentic feedback loop. You can’t switch to a cheaper execution mode to control costs. The iterative self-correction that makes it better is also what makes it expensive — there’s no cheaper version of how this model works.

Scale It Out

Scale these numbers to something realistic:

100 equivalent coding tasks per day:

Opus 4.6: ~$38/day → ~$13,870/year
Opus 4.7: ~$138/day → ~$50,370/year
Annual cost increase: +$36,500

This matches what production teams are reporting. The Finout analysis documented overnight cost jumps from $500 to $675/day after deploying 4.7. My testing provides a mechanistic explanation: the model’s working style is token-intensive by design.

The cost increase compounds with Anthropic’s separate tokenizer changes that increase consumption up to 35% for identical prompts. You get hit twice: more tokens per task, plus each token costs more to count.

This Is Anthropic’s Move, Not the Industry’s

Other providers aren’t doing this. GPT-5.4 achieves comparable benchmark performance without the tokenizer change. Anthropic can pull this off because they’re ahead on benchmarks right now — that’s the advantage, and they’re using it.

Which means this is actually a model selection problem, not a budget problem.

I’m not upgrading my agentic workflows to 4.7 by default. For complex architectural work where the reasoning depth matters — distributed systems debugging, refactoring decisions with downstream implications — yes, 4.7 earns the premium. For routine code generation, test writing, documentation? 4.6 passes the same tests at a quarter of the cost, as I just demonstrated.

Sonnet is even more aggressive on cost for work that doesn’t need Opus-level reasoning at all. I’ve been pushing more of my day-to-day agentic tasks there.

GPT-5.4 is worth keeping in rotation too. Comparable coding benchmark performance, no tokenizer games, and the competitive pressure helps if you ever need to push back on Anthropic pricing.

The Reddit community caught the tokenizer changes within hours of release while Anthropic’s communications stayed focused on “unchanged pricing.” That’s the early warning system. Watch community cost reports when a new model drops, not the vendor announcement.

Anthropic will keep doing this as long as they’re leading. The way you stay ahead of it is knowing your actual token consumption per task — not the rate card, the real burn — and routing work to the cheapest model that gets the job done.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Is This The Era of the Connector?

Robert Matsuoka — Thu, 26 Mar 2026 12:32:38 GMT

TL;DR

Users consolidate around 4 core platforms (Slack, Notion, Email+Office, AI Tool) while rejecting standalone/SASS tools
Connectors that bring data to users beat new tools that require visits
Infrastructure breakthrough: Slack manifests + MCP protocol + LLM services make org-specific connectors trivial
Democratization effect: Bootcamp engineers can now build sophisticated integrations that once required senior developers
Production evidence: 3 connectors (4-6 hours each, $1-2.5K in AI tokens) replaced 5-6 standalone tools that would cost $150-300K+ traditionally
Tolerance for “broad-based general tools” declining — UX mindshare captures traffic even when APIs do the work

In the past two weeks, I’ve built three connectors that collectively replaced what would have been five or six standalone tools.

Engineering Search Connector: Hosted semantic and knowledge graph search service built by repurposing mcp-vector-search. Unified search across 150+ GitHub repos, 1,700+ wiki pages, and ticket systems. Accessible through Slack bot, web interface, CLI, and MCP connector for Claude.AI that brings engineering knowledge to where people already work.

CRM Data Connector: Live customer data piped directly into Claude.AI sessions via MCP. No dashboard to check, no reports to generate. Ask “What’s our pipeline this quarter?” and get live data in 1-3 seconds.

Document Workflow Connector: Artifact browser and guided PR workflow for non-technical contributors. Product managers can explore and propose changes to structured docs without touching git or learning new interfaces.

None of these required users to adopt a new primary tool. Each brings specialized functionality to platforms they already inhabit daily. And each took roughly 4-6 hours to build (plus agent time).

This isn’t a productivity humble-brag. It’s evidence of a fundamental shift in how organizations interact with their data. We’re entering the connector era — building bridges between specialized intelligence and the handful of platforms where users actually live, rather than standalone applications they have to visit.

The numbers support this pattern. Users toggle between apps 1,200 times daily, losing 40% productivity to context switching. Connector ecosystems are exploding: Slack’s marketplace hosts 2,600+ apps with 550K+ daily custom integrations. The MCP protocol went from 100K to 8M downloads in six months — unprecedented adoption for plumbing infrastructure.

The question isn’t whether Slack, Notion, and Claude.AI will survive the AI wave. It’s whether the hundreds of specialized tools competing for attention understand that the game has changed. Users have less tolerance for broad-based general tools than they once did. The platforms that capture UX mindshare will get most of the traffic, even if APIs and agents do the actual work behind the scenes.

The evidence is clear from user behavior: they don’t want to learn a new search interface, remember another login, or context-switch to yet another tab. They want the intelligence layer to meet them where they already are.

The Source of Truth Problem

Most organizations have a source-of-truth problem they haven’t fully articulated. They have Slack for real-time communication. They have Notion or Confluence for documentation. They have Google Docs for drafts that become documents that become outdated that stay around anyway. They have JIRA for tickets that may or may not reflect what was actually decided. They call this a “knowledge management system.” It’s more accurately a distributed archive of partially-intentional artifacts with no clear authority hierarchy.

The question “who owns this decision?” leads to a Slack thread from eight months ago, a Notion page that three people edited and nobody is certain is current, and a Google Doc someone linked in a comment that requires permission to access. This is the status quo. It functions, after a fashion, because humans are good at triangulating across ambiguous sources and asking colleagues to fill gaps.

AI agents are not good at this. They will confidently synthesize the eight-month-old Slack thread with the outdated Notion page and present the result as a coherent answer. The errors won’t be obvious. They’ll be subtly wrong in ways that require domain expertise to catch.

The source of truth problem was always real. It was manageable when every query ran through a human brain. It becomes actively dangerous when queries run through an inference layer first.

What you actually need — what organizations are starting to build — is a repository where the data structure enforces truth. Not a place where the right answer might be findable if you look hard enough. A place where the structure of the data makes the wrong answer harder to produce.

But here’s the connector insight: that structured repository doesn’t need to be where users spend their time. It can be the authoritative backend that feeds connectors in the platforms users already inhabit.

Where Users Actually Live

User attention has consolidated around four core platforms:

Slack: Real-time coordination, team presence, ephemeral decisions. 32.3 million daily active users with 550K+ custom integrations daily.

Email + Office Suite: Formal communication, document collaboration, external stakeholder interface. Microsoft reports 400M+ Office 365 commercial users.

Notion: Knowledge management, project tracking, collaborative documentation. 100M+ users consolidating entire productivity stacks.

Claude.AI: AI assistance, analysis, content generation. Rapidly becoming the default interface for LLM interactions across knowledge work.

Each platform serves a legitimate core function. Tool builders make the mistake of assuming they can compete for primary platform status by building something better. Users are done adopting new primary platforms. They’re consolidating around tools that already have their attention.

The pattern reveals a deeper truth: people live in transactional systems, not knowledge systems. Slack is where decisions happen. Email is where approvals flow. Claude.AI is where analysis gets done. These are transactional - work happens there daily.

Confluence is a perfectly good wiki tool. But it’s knowledge-at-rest, not transactional. People don’t live there. They visit when forced to document something, then return to their transactional workflows. The knowledge gets stale because maintenance happens in a different system than usage. (Notion manages to straddle the line between knowledge at rest and transactional)

Integration platforms like Zapier understand this - they connect 8,000+ apps with 3.4M+ business users by bringing specialized functionality to existing workflows rather than creating new destinations.

Users just want the data, dammit. They don’t want to learn your interface.

The Connector Infrastructure Moment

What changed? Three pieces of infrastructure matured simultaneously:

Slack Manifest Tool makes organization-specific bots trivial to build. The manifest.yaml format standardizes permissions, scopes, and deployment. Weeks of OAuth wrestling became hours of configuration.

MCP Protocol achieved “USB-C for AI” universal connectivity. Claude.AI, ChatGPT, and dozens of platforms support the same connector format. Build once, deploy everywhere. The 100K to 8M download growth in six months reflects pent-up demand.

LLM Services like Bedrock and OpenRouter provide natural language interfaces that make connectors intelligent rather than just data pipes. Ask questions in plain English, get structured responses, maintain conversation context.

Semantic Search Infrastructure like mcp-vector-search can be repurposed as hosted services, adding intelligence layers that understand meaning rather than just matching keywords. This transforms basic data access into contextual knowledge retrieval — a crucial enabler for connectors that need to surface relevant information rather than exact matches.

Combined, you can build a production connector in a single afternoon. Slack manifest defines the bot interface. MCP schema defines the data sources. Semantic search handles intelligent retrieval. Bedrock provides the language understanding. Deploy to AWS Lambda and you’re live.

My three connectors follow this exact pattern. The engineering search connector repurposes mcp-vector-search as a hosted service with all-MiniLM-L6-v2 embeddings for semantic and knowledge graph search, but the user interface is just Slack commands and Claude.AI MCP tools. The CRM data connector is a headless AWS service that makes customer data available through natural language queries in Claude.AI. The document workflow connector provides git workflows through a web UI that non-technical users can navigate.

Each connector took 4-6 hours to build. Each would have taken 4-6 months to build as a standalone application with user management, authentication, interface design, mobile responsiveness, and all the infrastructure a “real app” requires.

The Democratization Effect: The infrastructure shift goes beyond development speed — it’s democratizing who can build sophisticated integrations. What once required senior engineers with deep API knowledge can now be handled by bootcamp graduates following established patterns. I built these first three connectors to validate the approach, but similar projects will go to junior engineers going forward.

This changes resource allocation fundamentally. Organizations can solve integration problems without burning senior engineering cycles on “plumbing” work. Information that was once very hard to obtain is now trivial to access.

The economics are compelling. Building three production connectors cost roughly $1,000-2,500 in AI tokens over 44 days. Traditional contractor development for equivalent functionality would have run $150-300K+. The connector approach isn’t just faster — it’s 100x more cost-effective.

The adoption metrics prove the value. The CRM connector launched March 18th with 23 invocations on day one. No formal rollout, no training sessions, no onboarding docs. Just organic discovery across a 300+ person company. By week two, daily usage tripled to 95 invocations per day. Tuesday hit 152 invocations — including a 40-query analysis session in a single hour. That’s 299 queries in 7 days with zero errors, from a connector that took 4-6 hours to build.

The era isn’t about choosing between platforms. It’s about connecting specialized intelligence to the platforms users have already chosen.

Why Wikis Can’t Compete in the Connector World

Traditional knowledge management tools face a structural mismatch in connector architecture. Wikis assume users will “go to the tool” for information. Connectors flip that assumption: the tool comes to the user.

This creates specific problems:

The Authoring/Retrieval Tension: Wikis optimize for collaborative authoring — anybody can edit, flexible structure, link everything, evolve over time. This is the opposite of what retrieval needs: consistent schema, clear ownership, explicit governance. When you pipe wiki content through a connector, you inherit all the inconsistencies that collaborative authoring creates.

Search Architecture Limitations: Confluence’s search is notoriously bad because it does keyword matching on unstructured text. This was problematic before LLMs. With LLM-powered connectors, it becomes worse because the AI layer adds confidence to bad retrieval results. Users get wrong answers delivered with conviction.

Static Data Problem: Notion’s AI operates on static content snapshots, disconnected from real-time operational state. When CRM connectors query “What’s our pipeline this quarter?” through a Notion connector, it’s answering based on what someone wrote about the pipeline, not live customer data. The connector amplifies the staleness problem.

Governance at Scale: Wiki governance defaults to “community-maintained,” which means in practice nobody is responsible for accuracy. As organizations scale, wikis accumulate pages nobody knows are outdated. Connectors don’t solve this — they accelerate the distribution of stale information.

Our structured document framework represents the alternative: git-backed Markdown with schema-validated YAML frontmatter. Every document has explicit metadata: owner, status, domain, confidence, time_box. The structure is the feature. When document workflow connectors expose this, the schema ensures consistent data quality regardless of interface.

Structured document repositories outperform wikis for AI query by 35-60% in controlled tests. Clean Markdown with explicit metadata reduces token usage by 20-30% and improves retrieval accuracy significantly. This isn’t philosophical — it’s measurable.

Wikis remain useful for collaborative drafting and evolving reference material. But they’re not the right backend for connector architecture. The connector era requires structured data sources that can maintain quality across multiple interface layers.

Building Connectors vs. Standalone Tools

The strategic choice organizations face isn’t “which tool should we build?” but “should we build a tool or a connector?” My experience with three connectors illuminates the trade-offs:

Engineering search could have been a standalone search platform. Instead, it’s accessible through Slack commands, CLI tools, web interface for visualizations, and MCP tools for Claude.AI sessions. Same search capability, four different interaction models depending on user context.

CRM integration could have been a dashboard with charts and filters. Instead, it’s a headless MCP service that makes customer data available through natural language in Claude.AI. Ask “Show me deals over $100K in our target vertical” and get live results in 1-3 seconds. No dashboard to learn, no visual interface to maintain.

Document workflows could have been a product management SaaS platform. Instead, it’s a guided workflow that helps non-technical contributors interact with existing git-backed document frameworks. Browse artifacts, generate AI summaries, submit PRs — all through interfaces that match users’ technical comfort levels.

Specialized intelligence delivered through platforms users already inhabit. The connector approach wins on several fronts.

Development time: 4-6 hours vs. months. No user management, authentication, responsive design, or mobile apps to build.

Adoption friction: Zero onboarding. No new logins, training sessions, or change management overhead.

Maintenance burden: Focus on data logic and intelligence, not interface maintenance across device types and browser versions.

Integration: Connectors compose naturally with existing workflows. Slack discussions can include live Salesforce data. Claude.AI analysis can pull from engineering knowledge graphs. Standalone tools require export/import workflows.

The business case is compelling: connector development costs 10-20% of standalone application development while achieving 3-4x higher user engagement.

The implications go beyond development efficiency. Users have less tolerance for “broad-based general tools” than they once did. Managing dozens of application contexts creates unsustainable cognitive load. Platforms that capture daily attention get most of the traffic, even when APIs and agents do the computational work behind the scenes.

This creates different winner-take-all dynamics. The winners aren’t necessarily the best tools. They’re the platforms users choose to inhabit, plus the connectors that bring specialized capability to those platforms.

What This Means for Your Stack

The connector era doesn’t eliminate existing tools — it clarifies their appropriate roles and challenges their assumptions about user attention.

Slack keeps its coordination function: Real-time presence, threading, ephemeral decisions. But it becomes a command interface for structured data sources rather than a knowledge repository itself.

Notion retains collaborative authoring value: Drafting, evolving documentation, reference material. But it stops being the “source of truth” for operational decisions. That role shifts to structured backends accessible through Notion connectors.

Specialized tools survive by becoming intelligent backends: Your CRM, your monitoring system, your code repositories — these maintain their core data authority. But user interaction shifts to connector layers in platforms where users already work.

The question to ask about any tool: Is this where I want an AI agent pointing when it needs authoritative information? If the answer is no, it’s not your source of truth. It might still be valuable — as a backend, as a collaborative space, as a specialized interface for expert users. But it doesn’t earn the designation of “primary platform.”

The organizational challenge: Getting non-technical teams comfortable with structured data workflows is real change management. Document workflow connectors address this by providing guided interfaces for git-backed workflows. But someone still needs to own schema design and governance processes.

Who should build connectors first: Engineering-adjacent teams with strong PM-engineering collaboration. Organizations where AI hallucination on operational decisions creates measurable cost. Companies that have already felt the pain of distributed knowledge management.

Timing matters: Most organizations haven’t built connector strategies yet. Companies that establish structured knowledge backends with connector frontends in 2026 will have 12-18 months of advantage when AI-mediated query becomes standard practice.

The connector era isn’t about choosing between platforms. It’s about connecting intelligent backends to platforms users have already chosen. Organizations that get this right will operate with less context switching and faster access to operational data.

Users just want the data, dammit. The question is: will you bring it to them, or keep expecting them to come to you?

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

MCP Was a Brilliant Idea — But It Needs a Proper API Behind It

Robert Matsuoka — Tue, 10 Mar 2026 11:30:57 GMT

The pattern shows up constantly when I look at MCP server implementations. Someone discovers the protocol, gets excited about giving agents tool access, builds a server in a weekend, ships it to the registry. Six tools. Maybe eight. Each one is basically a direct passthrough to whatever SDK the underlying service provides.

And for about a week, it feels like it works.

Then the agent needs to do something real. Archive 400 emails from a specific sender. Pull all calendar events for Q1, cross-reference them with a project timeline, and generate a summary. Move a batch of files across Drive folders. The agent starts calling tools in sequence, hits rate limits, gets confused about pagination, makes the same API call twelve times trying to work around a 50-item response limit that the MCP tool never exposed as a parameter. Eventually it either fails or produces something partially wrong, and nobody’s quite sure where the breakdown happened.

The bottleneck isn’t MCP. The protocol did exactly what it was supposed to do — it gave the agent a clean interface for calling tools. The bottleneck is what’s behind the MCP server.

I’ve built a lot of these now. gworkspace-mcp has 115 tools across Gmail, Calendar, Drive, Docs, Sheets, Slides, and Tasks. slack-mpm has 40+ tools plus a full async Python API library underneath that can run entirely without an agent in the loop. The gap between those projects and most of the reference MCP servers I’ve seen is not complexity — it’s architecture. Specifically: whether there’s a real API underneath the MCP layer, or whether the MCP tools ARE the implementation.

That distinction matters more than almost anything else when you’re building tools that agents will actually use in production.

TL;DR

MCP servers built as thin wrappers over service SDKs hit hard ceilings when agents need to do anything at volume or across operations
The reference Slack MCP server has 8 tools; a production implementation needs 40+, with a real API library underneath it
The three-layer pattern (API → MCP → Skills) has a specific job at each layer — remove any one and the system degrades in a predictable way
Tool description quality is the single biggest lever on agent behavior; bad descriptions produce bad decisions regardless of what’s underneath
Thin wrappers are fine for prototypes and read-light tools; the inflection point is when you want to write a script that does what the agent does

The Official Slack MCP Server Problem

The reference Slack MCP server — currently maintained by Zencoder after leaving the official MCP registry — offers eight tools. List channels. Post a message. Reply to a thread. Add a reaction. Get channel history. Get thread replies. Get users. Get a user profile. That’s it.

For a demo, that’s fine. For anything agents actually need to do with Slack, it hits walls quickly.

My slack-mpm server covers 40+ tools: search messages by date range and keyword, manage bookmarks, set reminders, handle scheduled messages, list workspace members with filtering, manage file uploads, archive channels. The implementation underneath is a clean async Python API — 47 functions across eight modules — that you can call directly from scripts without an agent in the loop at all.

The functional gap is obvious enough. What’s less obvious is why it exists.

The reference server isn’t thin because Slack’s API is thin. Slack’s API is extensive. The server is thin because it was built without a real API library underneath it. The MCP tools are the implementation — there’s no abstraction layer, no pagination handling, no rate limit management, no batch operation support. Each tool calls the Slack SDK directly and returns the result.

That works for eight tools. It doesn’t scale to forty because the complexity you’re hiding from the agent — auth edge cases, cursor-based pagination, retry logic on rate limits, handling the difference between bot tokens and user tokens — has nowhere to live. There’s no library to put it in. So you either skip the complex operations entirely, or you dump that complexity into the MCP tool handler itself, which makes the tool fragile and hard to maintain.

The reference server chose the first option. Which is reasonable for a reference — but it means agents using it can’t search Slack properly (search requires user tokens; the server only supports bot tokens), can’t do bulk operations, can’t run scheduled tasks, can’t be used programmatically outside of an agent context.

This isn’t a knock on the people who built it. It’s a knock on the pattern of treating MCP as the architecture rather than the interface.

Phil Schmid made a similar observation in January in a piece called MCP is Not the Problem, It’s Your Server: “MCP servers are not thin wrappers around your existing API. A good REST API is not a good MCP server.” Correct, but it doesn’t go quite far enough. The problem isn’t just that REST APIs make bad MCP servers — it’s that MCP servers without any abstraction layer underneath them make bad tools, regardless of what the underlying service looks like.

What a Real API Gives You

When I started building gworkspace-mcp, I made a decision early that turned out to be foundational: build the Google Workspace API library first, then write the MCP server as a thin interface on top of it. The API library handles auth, pagination, rate limiting, and error normalization. The MCP tools are mostly one-liners that call the right API function and return the result.

That decision shows up in five specific ways.

Pagination at the library level. Gmail’s API returns 50 messages per page by default. If an agent wants to archive everything from a specific sender over the past six months, that might be 400 messages across eight API calls. If the MCP tool handles pagination — which means the API library handles it — the agent calls one tool and gets back 400 message IDs. If pagination isn’t handled, the agent manages cursor iteration itself: call the tool, get 50 results, extract the cursor, call again, repeat. Agents do this badly. They lose track of cursors, make redundant calls, or give up after the first page and tell you they found the 50 most recent messages.

Rate limit management that doesn’t leak up. Rate limits handled in the API layer are invisible to the agent. The tool call either succeeds or returns a clean error. Rate limits handled in the tool handler either block the agent or require the tool description to explain retry patterns — which agents then implement inconsistently. The complexity belongs in the API layer. That’s the only place it can be handled reliably and tested against real behavior.

Reuse across contexts. The slack-mpm API library runs five standalone scripts — archiver, digest, listener, notifier, responder — that operate on schedules without any MCP involvement. The same code that handles pagination and auth in the MCP context handles it in the cron job context. This isn’t a nice-to-have: it means the code is continuously exercised against real-world conditions, not just when an agent happens to call it.

Actual testability. You can write unit tests for an API library. You can mock the underlying service calls, test pagination edge cases, verify that rate limit handling works correctly. Testing an MCP tool handler without running an agent session is hard to do meaningfully — you end up testing it by watching it fail in production. The difference in reliability compounds over time, and it compounds fast.

Composability. Some operations are inherently multi-step. Finding all calendar events associated with a project and generating a summary requires fetching from multiple calendars, filtering by keyword, sorting by date, and formatting the output. That can live in the API layer as a single higher-order function. The MCP tool calls the function. The agent sees one tool call that returns a clean result instead of orchestrating a dozen calls and trying to assemble the output itself.

Aditya Mehra put it well in a December 2025 piece on production MCP architecture: “Design for what agents need to accomplish, not for what APIs happen to exist. Your APIs were designed for developers building applications. Your MCP servers should be designed for agents completing tasks.” The API layer is where you do that translation. The MCP layer is where you expose the result.

The Three-Layer Pattern

The architecture I’ve converged on has three distinct layers, each with a specific job. Removing any one of them degrades the system in a predictable way.

Layer 1 — The API library. This is where the complexity lives. Auth handling, including token refresh and the difference between scoped token types. Pagination, including cursor management and automatic result aggregation. Rate limit management, with exponential backoff and respect for per-method limits. Error normalization, so the MCP layer receives clean, typed errors rather than raw API exceptions. Batch operations, so the agent can request 400 results and the API handles chunking appropriately.

The design test for this layer: you should be able to write useful scripts against the API library without involving an agent at all. If you can’t, the abstraction is at the wrong level.

Layer 2 — The MCP server. Thin. The tool handler should be almost trivially simple: validate inputs, call the API function, return the result. If a tool handler is doing significant work, something belongs in the API layer instead. Tool descriptions are not trivial — they’re the interface contract with the agent and deserve careful attention — but the execution path should be short. A handler that’s more than twenty lines of real logic is usually a sign something is in the wrong place.

The design test here: each tool should do one thing, and it should be obvious which tool to use for a given operation. When agents have to guess between tools, they guess wrong.

Layer 3 — The skills document. This is the layer most implementations skip, and it’s often where production agent behavior falls apart. The skills document tells the agent how to use the MCP tools effectively: what tools exist, when to use each one, which combinations work well together, what to avoid.

Without it, agents discover capabilities by trial and error — hitting rate limits unnecessarily, calling the wrong tool for the job, making redundant calls when one batched call would do. With it, agents start from a baseline of competent behavior and only deviate when they encounter something they haven’t hit before.

The skills document is institutional knowledge in structured form. It captures what took me hours of iteration to learn about each service — which Gmail search operators work reliably, when to use Drive’s query syntax versus simple name search, how to structure a Sheets batch update to avoid cell reference errors. That knowledge doesn’t exist anywhere else. It lives in the skills document or it doesn’t exist, and the agent stumbles into the same mistakes I made during development.

The MCP community is starting to recognize the description-as-instruction principle. Schmid’s framing is that “every piece of text is part of the agent’s context.” True, but individual tool descriptions can only carry so much. The skills document is where higher-order guidance lives — how to think about sequencing operations, when not to use a tool, what the common failure modes look like. Think of it as runtime instructions for agents, not documentation for humans.

When MCP Alone Is Enough

There are cases where thin MCP wrappers are the right call, and it’s worth being direct about them.

Simple, low-volume reads. If an agent needs to check weather, query a single record from an external service, or look up one user profile, a thin wrapper is probably fine. The complexity ceiling exists but may never be reached. Building a full API layer for a tool that makes one API call per agent turn is engineering overhead that doesn’t pay off.

Prototyping and exploration. A server built in a day is often the right first step because you don’t know yet which operations the agent will actually need. I’ve shipped thin wrappers deliberately as a way to learn before investing in a proper API library. The Zencoder Slack server probably started that way. The mistake isn’t building a thin wrapper for exploration — it’s leaving it there when the agent starts doing real work and the wrapper’s limits start showing.

Single-agent, single-purpose tools. If a tool is purpose-built for one agent doing one thing and the scope is genuinely narrow, the three-layer overhead may not be worth it. The architecture makes sense when tools need to be reused across contexts, when operations are high-volume, or when the underlying service is rate-sensitive.

Read-heavy, write-light operations. The complexity of batch operations, cursor management, and retry logic matters most when you’re writing or doing high-volume reads. A tool that fetches a single resource per agent turn doesn’t need much abstraction.

The honest signal for when you need the full pattern is one of three things: you find yourself wanting to write a script that does what the agent does, the agent hits the same rate limit more than once in a session, or you start duplicating error handling logic across tool handlers. Any of those is the inflection point. At that moment, adding the API layer is less work than continuing without it.

Building Your Own: Where to Start

Start with the API, not the MCP server. The most common mistake is writing the MCP tool first — it seems like the path of least resistance — and then adding abstraction as you hit problems. The trouble is that tool handler code is hard to refactor. The MCP interface shapes how you think about the operations, and that framing tends to be too granular. Starting with the API forces the right level of abstraction from the beginning.

Design the API around operations, not endpoints. Slack’s Web API has dozens of endpoints, but agents think in operations: send a message, search conversations, get user context. The API library should expose those operations, even when the underlying service requires two calls to complete one. Complexity belongs in the library. The agent-facing interface stays clean.

Invest heavily in tool descriptions. The single biggest lever on how well agents use your MCP server is description quality. That means specific parameter descriptions, not just type annotations. It means clear examples of when to use this tool versus a similar one. It means explicit notes about what a tool cannot do — agents will try to use tools for operations they weren’t designed for, and a good description cuts off the most common wrong paths before they happen.

Write the skills document while you build. Don’t wait until the server is done. Every time you notice the agent doing something inefficient — calling four tools when one would work, misunderstanding a parameter’s purpose, hitting a rate limit that could have been avoided — write that observation down immediately. The skills document is most valuable when it’s written from observation of real agent behavior, not reconstructed after the fact from memory.

The gworkspace-mcp repository on GitHub is a worked reference — 115 tools across seven Google APIs, one coherent server, the three-layer pattern at scale. Not the only way to implement this, but a concrete example of what the architecture looks like when the abstractions have had to earn their keep over months of real use.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Edits:

Fixed link to slack-mpm

Context Memory and Search: The Secrets to Effective Agentic Work

Robert Matsuoka — Thu, 26 Feb 2026 13:03:15 GMT

What Makes AI Coding Effective

Last weekend, working on performance improvements to my MCP vector search engine, I noticed something. The breakthrough in AI coding isn’t smarter models — it’s information architecture. The tools that actually work aren’t necessarily the ones with the biggest context windows. They’re the ones that find the right context and remember what matters.

Here’s what I mean. I’ve been using search and memory together long enough that I don’t think about them anymore. My prompts have gotten measurably shorter — an analysis of my sessions shows prompts averaging 12-15 words in mid-2025 dropping to 6-8 words now. “Check logs.” “What’s the command to quantize the index?” I just assume the agents will find the context they need. When I stepped back and thought about what changed, it came down to two things: Search and Memory.

You can see this pattern across successful AI coding tools. Claude MPM consistently outperforms Claude Code on its own — not because the underlying agentic AI differs, but because MPM brings the right context to the agents rather than flooding them with everything. Tools like Augment and Cursor have made similar investments in context retrieval. The winning tools aren’t the ones with the smartest models. They’re the ones that solved information architecture.

Search: Why Bigger Context Windows Aren’t the Answer

The promise of massive context windows is seductive: dump your entire codebase into the AI and let it figure out what’s relevant. The research tells a different story.

Liu et al.’s 2023 paper “Lost in the Middle: How Language Models Use Long Contexts” documented a U-shaped performance curve: models process information well at the beginning and end of long contexts, but performance drops 30% or more when relevant information is buried in the middle. This has been replicated across models since. Feeding a 500K-line codebase into Claude’s context window actually makes it worse at finding relevant patterns than targeted search.

Large context approach: AI gets overwhelmed. Focuses on random details buried in the middle of files. Expensive to run.

Search approach: AI gets exactly what it needs. Finds patterns quickly. Much cheaper to operate.

When you ask for “all authentication code that handles OAuth,” a semantic search returns exactly that — not every file that mentions the word “auth.” The AI gets relevant context, not noise.

The big AI vendors haven’t solved this yet. OpenAI and Anthropic are focused on the language models themselves. Neither has built search integration into their core products. The reasons are understandable — it’s genuinely hard to install and configure, and most users don’t work with enough data at once to need it. A simple find command covers most cases. But for serious engineering work on large codebases, the gap is real and growing.

Memory: Building on Previous Work Instead of Starting Over

Without memory that persists between sessions, every interaction starts from zero. The AI relearns your codebase, your patterns, your preferences each time. This isn’t just inconvenient — it’s a fundamental barrier to longer-running, multi-session agentic work.

Both OpenAI and Anthropic have shipped memory systems. They took different approaches.

OpenAI’s approach is user-centric — it remembers across all conversations, coding style, project preferences, common patterns. The interesting part: it includes personalized filtering that adjusts based on what it remembers about you. The downside is that it’s user-wide, not project-specific. Working across very different projects means the memory accumulates conflicting patterns.

Anthropic’s approach is project-based. Memory lives in CLAUDE.md files you can read and edit directly — you know exactly what the AI remembers about your project. The limitation is fading memory as files grow large; when a CLAUDE.md hits context window limits, older memories get pushed out.

Both reveal the same truth: memory isn’t just storage. It’s continuity across complex workflows.

There’s a subtler problem neither addresses well: your understanding evolves. Early assumptions might be wrong. Initial decisions might not hold up. A memory system that weights everything equally anchors the AI to outdated context. This is why I built Kuzu Memory as a graph storage system with temporal decay — more recent memories rank higher than older ones. I’m using it in this writing project, and it makes a real difference on long work streams where your thinking changes over time.

The market is fragmented right now: memory without search (OpenAI, Anthropic) or search without managed memory (most code tools). The tools that combine both — like MPM with Kuzu and MCP vector search — are ahead of where the mainstream market will be.

What You Can Do Today

If you want to try this yourself:

For search: MCP vector search now includes code review. It finds relevant patterns across your codebase without flooding the AI with irrelevant information. Works with any MCP-supporting framework — Claude Code, Codex, Gemini.

For memory: Kuzu Memory uses graph storage with temporal decay. Recent information ranks higher than older information — crucial for projects where your understanding evolves.

The specific tools matter less than the principle. Agentic workflows are longer-running and more complex than chat. They require building on previous work, not rebuilding context from scratch every session. The AI systems that enable this aren’t necessarily the smartest — they’re the ones that remember what matters and find what’s relevant.

The proof is in the prompts. If your queries to AI are getting shorter over time, your information architecture is working.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

If Your Coding Agent Can’t Search — Why search capability is the missing piece in most AI coding setups
Why I Built My Own Multi-Agent Framework — The reasoning behind MPM and why delegation-first architecture matters
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

What’s In My Toolkit: Claude Code and Family

Robert Matsuoka — Wed, 28 Jan 2026 13:30:24 GMT

Claude Code is getting a lot of love lately, and rightly so. But here’s what most of the hype pieces miss: Claude Code on its own is a framework that people build on. Running it vanilla won’t give you the magical results you might be expecting.

I’ve spent the last several months building and testing tools that extend Claude Code’s capabilities—and using them daily on real client work. The results? Sessions that run longer before context drift becomes a problem. Semantic code search that actually finds what I’m looking for across large codebases. Persistent memory that makes subsequent prompts more effective. Here’s my complete toolkit and how to set it up.

A caveat before we dive in: this setup assumes you have local control over your development environment—no corporate proxies blocking MCP connections, no air-gapped networks, no policies preventing CLI tool installation. If you’re in a locked-down enterprise environment, some of this won’t apply cleanly. I’ll note the dependencies as we go.

(Everything I’m covering is in my LLM Toolkit GitHub list if you want to browse.)

TL;DR

Claude MPM provides multi-agent orchestration, specialized agents, and session continuity on top of Claude Code
mcp-vector-search enables semantic AST-based code search—tested on codebases up to 230K lines
kuzu-memory remembers prompts and commits, enriching future sessions automatically
mcp-ticketer powers ticket-driven development with Linear, GitHub, Jira, and Asana integration
mcp-skillset provides a searchable vector + graph database of curated skills
These tools work with any MCP-compatible coding assistant, but they’re designed to work together

Why vanilla Claude Code isn’t enough

Don’t get me wrong—Claude Code handles most tasks beautifully out of the box. The agent loop, file operations, git workflows, bash execution. For quick tasks and single-file changes, you don’t need anything else.

But here’s where it falls short:

Context evaporates. Long sessions hit the context window limit and you start over. Previous conversations? Gone. That decision you made three hours ago about architecture? Claude doesn’t remember it.

Code search is keyword-based. When you ask “where do we handle authentication?” but the code uses “login” and “session validation,” you get nothing useful back.

No persistent memory. Every session starts from zero. The patterns Claude learned about your codebase yesterday? Lost.

Single-threaded execution. You’re running one Claude instance doing one thing at a time.

The tools I’ve built address each of these limitations. And importantly, they’re designed to work together—or independently with any MCP-compatible coding tool.

Claude MPM: The orchestration layer

Claude MPM (Multi-Agent Project Manager) is my orchestration framework built on top of Claude Code. It’s now at version 5.1.2 with 1,388 commits and 198 releases. Here’s what it provides:

47+ specialized agents deploy to your ~/.claude/agents/ directory—Python Engineer, Rust Engineer, TypeScript Engineer, QA, Security, Ops, Documentation specialists. Each agent has domain-specific instructions that improve output quality for its area. How much improvement varies by task type; I see the biggest gains on framework-specific work where the agent’s instructions include current idioms.

Session continuity through automatic context summaries at 70%, 85%, and 95% thresholds. The --resume flag picks up where you left off instead of starting over. This works better for implementation sessions than exploratory ones—summaries necessarily lose nuance.

Git-first architecture pulls agents and skills from repositories rather than bundling everything locally. Custom repos slot in via priority-based resolution.

Installation

Pick your preferred method:

# Recommended: includes monitoring dashboard
pipx install "claude-mpm[monitor]"

# Alternative via uv
uv tool install claude-mpm

# macOS via Homebrew
brew tap bobmatnyc/tools && brew install claude-mpm

Then run:

claude-mpm run

That’s it. The agents deploy automatically.

Why orchestration matters more than raw model power

I tested this extensively and wrote about the results. In my testing across a set of 50 Python refactoring tasks (mix of greenfield and legacy code, evaluated by whether tests passed post-change), Claude MPM achieved 96.2% success compared to 78% for vanilla Claude Code on the same tasks. The same underlying model produces noticeably different results depending on how it’s orchestrated.

The hierarchical BASE-AGENT.md pattern reduces agent instruction duplication by 57% (measured in instruction token count, not behavioral overlap) through template inheritance, while ETag-based caching cuts network bandwidth by 95%+ when pulling agent updates.

mcp-vector-search: Semantic code understanding

mcp-vector-search changes the failure modes of codebase navigation. Instead of grep-style keyword matching, it uses AST-aware parsing and semantic embeddings to find code by meaning.

Here’s a real example: Last week I pointed it at a client’s Java codebase—230,000 lines across 1,200 files. They wanted to understand their authentication flow. Keyword search for “auth” returned noise. Semantic search for “user authentication and session management” returned the exact classes and methods responsible, ranked by relevance. The largest codebase I’ve indexed was around 400K lines; beyond that, indexing time becomes painful and you’ll want to scope to specific directories.

How it works

The tool parses code using Tree-sitter (8 languages supported: Python, JavaScript, TypeScript, Dart, PHP, Ruby, HTML, Markdown), generates embeddings via all-MiniLM-L6-v2, and stores vectors in ChromaDB. Connection pooling provides ~14% faster query response in my benchmarks (measured on repeated semantic queries against a 50K-line TypeScript repo, M2 MacBook). File watching triggers automatic reindexing when code changes.

Setup

# Install
pip install mcp-vector-search

# Initialize (creates ChromaDB, configures embeddings)
mcp-vector-search setup

# Add to Claude Code
claude mcp add mcp-vector-search

Then index your codebase:

mcp-vector-search index /path/to/your/code

Now Claude can search semantically. Ask “where do we validate user permissions?” and get meaningful results even if the code never uses those exact words.

Where it doesn’t help

Semantic search isn’t magic. It struggles with highly domain-specific terminology that didn’t appear in the embedding model’s training data. Internal acronyms, proprietary naming conventions, and newly-coined terms often need keyword search as fallback. I keep both approaches available.

kuzu-memory: Persistent context that compounds

kuzu-memory solves the “starting from zero” problem. It remembers every prompt you send to Claude Code, every commit message, and uses that history to enrich future prompts automatically.

The more you use it, the better it gets.

The architecture

Built on Kuzu, an embedded graph database that’s fast (<3ms recall, <8ms generation), offline-first, and requires no LLM calls for memory operations. The entire database is a single file under 10MB—perfect for version control.

The cognitive memory model mirrors how human memory works:

SEMANTIC (never expires): Facts about your codebase, architecture decisions
EPISODIC (30 days): Specific experiences, debugging sessions, what worked
WORKING (1 day): Current task context
SENSORY (6 hours): Recent observations

Git commit history enrichment automatically captures project evolution, so Claude understands what changed and why.

Setup

# Install
pip install kuzu-memory

# Initialize
kuzu-memory setup

# Add to Claude Code
claude mcp add kuzu-memory

You can also integrate via hooks to automatically capture context from every session.

mcp-ticketer: Ticket-driven development

mcp-ticketer is how I implement TxDD (Ticket-Driven Development). Look at the issues in any of my repos and you’ll see the pattern—structured tickets that capture research, decisions, and implementation context.

The tool provides a unified interface across Linear, GitHub Issues, Jira, and Asana. One API, multiple backends.

Why this matters

Token efficiency is critical when you’re loading ticket context. Compact mode delivers 70% token reduction for ticket lists, letting you query 3x more tickets within context limits. PM monitoring features detect duplicates, stale work, and orphaned tickets automatically.

Setup

# Install
pip install mcp-ticketer

# Configure your backend (example: Linear)
mcp-ticketer config set linear --api-key YOUR_KEY

# Add to Claude Code
claude mcp add mcp-ticketer

Now Claude can create, query, and update tickets as part of its workflow.

mcp-skillset: Dynamic skill discovery

mcp-skillset provides a searchable database of curated skills—including some of the best skill repos out there, plus your own custom additions.

Unlike static skills loaded at startup, this enables runtime discovery using hybrid search: 70% vector (ChromaDB) + 30% knowledge graph (NetworkX) by default, tunable based on your needs.

What’s included

The default index includes Anthropic’s official skills, community-contributed patterns, and framework-specific guidance. Security features include prompt injection detection and repository trust levels.

Setup

# Install
pip install mcp-skillset

# Initialize with default skill repos
mcp-skillset setup

# Add custom skill repos
mcp-skillset add-repo https://github.com/your-team/custom-skills

# Add to Claude Code
claude mcp add mcp-skillset

Brief comparison: Other orchestration approaches

I’ve designed these tools to work with any MCP-compatible coding assistant, not just my framework. But if you’re evaluating orchestration options, here’s the landscape. A caveat: most of these performance claims are self-reported and benchmark-specific. I haven’t independently verified them, and SWE-Bench scores in particular don’t always translate to real-world performance.

claude-flow (11K+ stars) is the maximalist option: 17 SPARC modes, 54+ agents, 100+ MCP tools, web UI dashboard. The project claims 84.8% SWE-Bench solve rates—impressive if reproducible, though I haven’t tested it myself. If you need enterprise-grade features, this is worth evaluating. The complexity ceiling is high; budget time for configuration.

claude-squad (5.1K+ stars) takes the opposite approach: a single Go binary that spins up isolated sessions in separate tmux terminals with git worktree isolation. Dead simple. cs starts the TUI, you work in parallel, review and merge. Zero configuration. This is what I recommend if you just want parallelism without learning a new system.

claude-swarm (1.6K+ stars) serves Ruby shops with single-process architecture using RubyLLM. Good if you’re building Rails applications and want native library integration rather than CLI orchestration.

ccswarm brings Rust’s performance guarantees—type-state patterns with zero shared state, claimed 70% memory reduction through native context compression. Haven’t stress-tested this one; the architecture looks promising for resource-constrained environments.

My approach with Claude MPM sits somewhere in the middle: more capable than bare Claude Code, less complex than claude-flow, with a focus on agent quality and session continuity over feature breadth. The trade-off is that it’s opinionated about workflow—if you want raw flexibility, claude-squad might suit better.

Putting it all together

Here’s my actual workflow with these tools running together:

kuzu-memory loads relevant context from previous sessions automatically
mcp-vector-search helps Claude find the right code when exploring the codebase
mcp-ticketer pulls in the current ticket’s context and acceptance criteria
mcp-skillset provides relevant best practices for the task at hand
claude-mpm orchestrates specialized agents for implementation, testing, and review

The result: sessions that run longer before I need to reset context, fewer “where is this code?” dead ends, and accumulated knowledge that actually persists between sessions. Before I built this stack, I was spending maybe 20% of my Claude Code time re-explaining context or manually finding files. That overhead dropped significantly.

These tools work independently too. Use mcp-vector-search with vanilla Claude Code. Add kuzu-memory to Cursor or Windsurf via MCP. Mix and match based on what you need.

The broader point: Claude Code is becoming infrastructure, not just a tool. The value increasingly comes from how you extend it—the orchestration layer, context management, workflow integration. If you’re getting mediocre results from vanilla Claude Code, the fix probably isn’t a better model. It’s better scaffolding around the model you have.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent approaches, read my analysis of why orchestration beats raw power or my deep dive into Claude MPM 5’s architecture.

What's In My Toolkit: Digital Ocean

Robert Matsuoka — Wed, 21 Jan 2026 13:31:49 GMT

I talk about Vercel a lot. Probably too much. They’re great for what they do—Next.js deployments, edge functions, the serverless stuff. But not everything fits the serverless model. Sometimes you need a server. An actual server that runs continuously, handles persistent connections, maybe runs PHP because your client’s legacy app requires it.

That’s where I found myself last fall. Client project. Laravel application handling webhook callbacks from a payment processor. Required persistent connections and background job processing. Vercel wasn’t going to cut it.

Where Railway and Vercel Don’t Reach

Railway was my first thought. I’ve used it before, and the developer experience is genuinely good. Automatic vertical scaling, nice preview environments, all that modern stuff. But Railway’s primarily built for building—you push code, it builds and deploys. When you need a broader set of management tools and services, they’re not quite there. And the pricing gets weird fast for anything running 24/7. Their $5 trial credit evaporates if you’re not careful.

Traditional cPanel hosting—Hostgator, Bluehost, the whole gang—exists in a completely different universe. Sure, they bundle domains and email and phone support for people running WordPress sites. But the performance is mediocre, the interfaces feel like 2008, and the moment you want to do anything slightly unusual, you’re fighting the system.

Digital Ocean sat on my radar for years. I knew about it vaguely. Then this Laravel project forced me to actually look.

The difference from cPanel hosting is immediate: full SSH access, real package management, actual networking controls instead of web forms that generate .htaccess files. You’re working with a Linux server, not a managed WordPress environment pretending to be one.

More Than Droplets Now

You deploy servers as droplets—KVM-based virtual machines running on their hypervisor infrastructure. Basic droplets start at $4/month (as of early 2025) for 512MB RAM, 1 vCPU, 10GB SSD. Nothing fancy, but enough to run a small PHP app. The $6 tier doubles everything. Straightforward.

But the droplet thing isn’t what impressed me. Digital Ocean has quietly built out a whole platform while I wasn’t paying attention:

App Platform: Git-based deployment starting at $5/month. Push to main, it builds and deploys. Less polished than Vercel for frontend stuff, but handles backend services Vercel can’t touch.
Managed databases: PostgreSQL, MySQL, MongoDB, Kafka, Redis-compatible Valkey. The PostgreSQL offering starts around $15/month (pricing varies by region) with daily automatic backups and 7-day point-in-time recovery. Real managed services, not “here’s a VM, figure it out.”
Spaces: S3-compatible object storage at $5/month including 250GB storage, 1TB transfer, and a built-in CDN. Compare that to AWS S3’s calculator nightmare.
Block storage: NFS-attachable volumes at $0.10/GiB/month. Mount additional storage to droplets without rebuilding.
Kubernetes: Free control plane. That’s not a typo. AWS EKS charges $72/month for the control plane alone before you add worker nodes.
Marketplace: One-click deployments for WordPress, Ghost, MongoDB, Redis, dozens of others. Not as extensive as AWS but covers most common needs.

The breadth was surprising. This isn’t just VPS hosting anymore.

doctl and Agent-Friendly Infrastructure

Here’s where it gets relevant for anyone doing agentic development: doctl works well.

Digital Ocean’s CLI covers the full platform—droplets, databases, Kubernetes clusters, app deployments, DNS, firewalls, load balancers. Written in Go. 3.4k GitHub stars. Installs via Homebrew, Snap, or direct download.

More importantly, it plays well with AI coding assistants. The command structure is predictable enough that Claude Code and similar tools can figure out what you’re trying to do. doctl compute droplet create does what you’d expect. doctl apps create --spec app.yaml deploys from a declarative config. JSON output works by default for parsing.

The GitHub Action for doctl makes CI/CD integration straightforward. Push code, build container, deploy to Kubernetes—all scriptable, all agent-friendly.

One limitation: doctl doesn’t handle Spaces directly. You need s3cmd or the AWS CLI for object storage operations. Annoying but workable.

The Pricing Reality Check

Digital Ocean’s pricing philosophy differs fundamentally from AWS. You know what something costs before you provision it. As one AWS consultant put it: “If you hire me to optimize your DigitalOcean bill, you’re effectively paying me to perform basic arithmetic. AWS surprises are on the order of ‘15 grand because you drastically misunderstood something.’”

That said, the pricing gets less competitive at scale. Higher-tier instances and managed databases draw criticism for being expensive compared to self-hosting. Load balancers at $12/month minimum can quadruple the cost of a small droplet setup. Block storage keeps charging even when unattached—found that out the hard way.

Quick comparison for context (early 2025 pricing):

Service Digital Ocean AWS Equivalent Basic VM $4/month EC2 t3.nano: ~$3.80/month Kubernetes control plane Free EKS: $72/month (control plane only) Object storage (250GB) $5/month S3: Variable, plus egress Managed PostgreSQL ~$15/month RDS: ~$13-15/month minimum

The Kubernetes control plane pricing alone makes Digital Ocean worth considering if you’re running containers. But note that AWS offers different operational trade-offs—more regions, more compliance certifications, more mature tooling.

Managed Services: Where It Gets Messy

My research turned up some concerning Hacker News threads from January 2025. A startup founder described a late-night emergency where Digital Ocean’s managed PostgreSQL and managed Kubernetes stopped talking to each other after an infrastructure update. VPC routing broke. Cilium ARP entries went stale.

Another developer flagged that managed PostgreSQL replicas use async replication with RPO potentially exceeding 15 minutes. During upgrades that trigger failover, you can lose minutes of committed data. Not ideal for anything handling money.

The counterargument from the same thread: “We were on AWS for a while. The complexity was way higher than what our team could manage. DOKS is simpler, and this is the first major issue we’ve hit in many months.”

Managed doesn’t mean worry-free. It means trading your failure modes for the vendor’s failure modes. Whether that trade makes sense depends on your ops capacity.

Digital Ocean’s AI Bet

Digital Ocean launched their GenAI Platform in January 2025. Function calling, RAG, guardrails, multi-agent coordination. They claim support for multiple foundation models including Claude, GPT-4, Llama, Mistral, and DeepSeek—though the exact integration depth (API passthrough vs. hosted inference vs. marketplace images) varies. Their AI/ML ARR reportedly grew over 200% year-over-year in 2024.

More relevant for agentic development: they’ve announced MCP (Model Context Protocol) integration. The pitch is that AI assistants like Claude can connect directly to your Digital Ocean account for autonomous infrastructure provisioning. I haven’t tested this yet, so I can’t vouch for how well it actually works in practice.

GPU Droplets launched in October 2024 with NVIDIA H100 access. Announced pricing starts around $1.50/GPU/hour for reserved instances, with per-second billing reportedly coming in early 2026. Worth verifying current rates before committing.

Digital Ocean clearly sees agentic development as a growth vector. The combination of simpler infrastructure, predictable costs, CLI tooling, and MCP integration makes sense strategically—whether it translates to production-ready capabilities is still an open question for teams at the bleeding edge.

Who This Actually Fits

Digital Ocean makes sense for:

Solo developers and small teams needing a deployment target for AI-generated code
Startups wanting predictable costs without AWS complexity
PHP, Python, Ruby, Go applications that don’t fit the serverless model
Teams escaping Heroku after the free tier changes
Anyone running Kubernetes who doesn’t want to pay $72/month for a control plane
Projects that need object storage, managed databases, AND compute in one place

Digital Ocean doesn’t make sense for:

Windows-based deployments (not supported)
Enterprise compliance requirements beyond HIPAA
Complex multi-region architectures needing 50+ global regions
Teams requiring 24/7 phone support

What I Shipped

The PHP project that sent me here? Running on a droplet with managed MySQL. Total monthly cost around $22. The setup wasn’t completely smooth—spent about an hour fighting with UFW firewall rules that kept blocking the database connection even after I’d supposedly allowed the right ports. Turned out I needed to allow the VPC subnet, not just the specific IP. The docs mentioned this but buried it three pages deep.

No surprise bills though. The client’s happy. Sometimes that’s enough.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on deployment options and developer tooling, check out my analysis of What’s In My Toolkit - August 2025 or my deep dive into multi-agent orchestration patterns.

The Irreducibles: What a Pattern Master Does

Robert Matsuoka — Wed, 14 Jan 2026 11:31:21 GMT

The Pattern Master

The real work was never about writing code.

That statement would have been controversial three years ago. Today, with the Stack Overflow 2025 Developer Survey showing 65% of developers using AI tools weekly and my own projects showing 6-10x productivity gains on greenfield implementation tasks, it’s becoming harder to argue. The interesting question isn’t whether AI changes software engineering—it’s what remains when implementation gets automated.

Here’s my new working theory, built from recent research and hands-on experience orchestrating AI agents across complex projects: senior engineering is converging toward a role that looks more like subject matter expert plus systems architect than traditional developer. The code-writing layer is becoming infrastructure—important, but increasingly invisible. What emerges is something both familiar and radically different.

I wrote recently about the Jacquard loom lesson—how the Canuts who fought automation lost, while pattern masters who designed the punch cards thrived. The same dynamic is playing out now. The question isn’t whether to use AI tools. It’s whether you become the pattern master who designs what they execute.

Where the Value Actually Sits

The most revealing data point isn’t about whether AI tools work—it’s about who they work for.

The Faros AI Productivity Paradox Report (July 2025) analyzed data across thousands of developers and found something telling: “Adoption skews toward less tenured engineers. Usage is highest among engineers who are newer to the company... lower adoption among senior engineers may signal skepticism about AI’s ability to support more complex tasks that depend on deep system knowledge and organizational context.”

That’s not a failure of AI tools—it’s a signal about where constraints actually exist. If your bottleneck is navigating unfamiliar code and accelerating early contributions, AI helps enormously. If your bottleneck is the “deep system knowledge and organizational context” that seniors carry—code generation speed is irrelevant.

A University of Chicago working paper (November 2025) found something even more interesting: experienced developers were 5-6% more likely to successfully use AI agents for every standard deviation of work experience. Why? They used “plan-first” approaches—”laying out objectives, alternatives, and steps” before invoking AI. Juniors did this far less frequently. The paper concludes: “expertise improves the ability to delegate to AI.”

Here’s the thing: senior engineers already spend most of their time on non-coding work. Jue Wang, a partner at Bain, told MIT Technology Review last week that “developers spend only 20% to 40% of their time coding.” The rest goes to analyzing problems, customer feedback, product strategy, and administrative tasks.

AI doesn’t change what senior engineering is. It reveals what it always was.

A Case Study in What Actually Happened

I recently completed a project that illustrates where the human work actually sits. Building a semantic search knowledge base for a travel agency client, I tracked every commit across 9 calendar days: 120 total commits, roughly 90% Claude-assisted.

The productivity numbers look impressive: 6-10x multiplier compared to my baseline velocity on similar projects. At my consulting rate, what I estimate would have taken 150-200 billable hours compressed to about $100-200 in API tokens plus 50-70 hours of wall-clock time—most of which wasn’t coding.

But here’s what’s interesting about the 12 human-only commits (10% of total):

Configuration tweaks requiring domain knowledge (model selection for specific use cases)
Debug logging (quick diagnostics when something felt wrong)
Release management
One research document on Slack architecture options

The human contributions weren’t about implementation—they were about judgment. Choosing the right model for email writing versus general queries. Knowing when the AI’s suggestion would create problems downstream. Understanding the client’s actual workflows well enough to structure the system appropriately.

What surprised me: the time savings didn’t come from faster typing. They came from eliminating the iteration cycles between “write code” and “realize it doesn’t fit the requirements.” Specifying clearly upfront meant fewer rewrites—but that specification work was irreducibly human.

The Specification-Driven Development Thesis

The pattern I’m seeing aligns with what several industry voices are now articulating.

Addy Osmani at Google argues developers are evolving from “coders” to “conductors” orchestrating AI agents. Kent Beck—one of the Agile Manifesto authors—suggests “augmented coding” deprecates language expertise while amplifying vision, strategy, and task breakdown. Sean Grove from OpenAI put it directly: “The person who communicates the best will be the most valuable programmer.”

Specification-Driven Development is now on the ThoughtWorks Technology Radar. Tools like AWS Kiro, GitHub Spec-Kit, and Tessl are building products around this premise. Andreessen Horowitz frames this as “the largest revolution in software development since its inception”—venture rhetoric, obviously, but the underlying bet (prompts as source code, specifications as maintained artifacts) is getting serious investment.

Worth noting what doesn’t work as smoothly: legacy codebases with decades of undocumented business logic, highly regulated environments where audit trails matter, and anything requiring coordination across organizational boundaries. The pattern I’m describing fits greenfield projects and well-documented systems better than the brownfield reality most enterprises face.

This isn’t abstract. Builder.io has defined an “Orchestrator” workflow: spec → onboard → direct → verify → integrate. That sequence describes what I actually did on the knowledge base project. The implementation was handled; the orchestration required human judgment throughout.

What AI Actually Can’t Do (Yet)

The Qodo 2025 State of AI Coding survey found only 3.8% of developers report both low hallucination rates and high confidence shipping AI code without review. 65% cite missing context as the primary barrier.

That “missing context” is the key. Consider what the AI couldn’t know during my knowledge base project:

Business model specifics: How the travel agency’s supplier relationships actually work. Which data matters for their specific service model. Why certain integrations were higher priority than others.
Organizational constraints: Budget limitations. Timeline pressures from a specific upcoming sales season. The technical capabilities of the staff who would maintain the system.
Historical context: Why previous approaches to similar problems hadn’t worked. What the client had tried before and rejected. Political dynamics around system adoption.

None of this lives on the public web. It exists in Jira tickets, PowerPoint decks, Slack conversations, and institutional memory. The METR study (July 2025) specifically identified “tools’ lack of vital tacit context or knowledge” as a key factor in why experienced developers were slower with AI assistance. Ryan Salva, senior director of product management at Google, told MIT Technology Review: “A lot of work needs to be done to help build up context and get the tribal knowledge out of our heads.”

The Spider 2.0 benchmarks confirm this: AI scores drop significantly on actual enterprise workflows compared to clean academic datasets. Real systems have messy schemas, undocumented business rules, and constraints that only make sense if you understand why they exist.

The Security and Quality Problem Compounds This

Here’s a less-discussed limitation: Veracode’s 2025 GenAI Code Security Report, which tested over 100 LLMs across 80 controlled coding tasks, found that 45% of AI-generated code introduced security vulnerabilities in their experimental setup. Java was worst at 72% failure rate. The controlled environment matters—real-world results vary with prompting quality and review practices—but the underlying issue is real: security requires understanding threat models, compliance requirements, and risk tolerances that vary by organization.

Miguel Grinberg, a 30-year development veteran, observed that code review takes as long as writing code when AI is doing the implementation. More importantly, there’s an accountability dimension: “AI won’t assume liability if code malfunctions.” Someone has to own the outcome, and that ownership requires understanding what the system is supposed to do.

The Multi-Agent Future Amplifies This Pattern

The trajectory I’m betting on: moving from single-agent assistance toward multi-agent orchestration. That shift would concentrate human work even further up the stack—if the infrastructure materializes.

IBM’s research on AI orchestration describes the emerging pattern: multiple agents with specific expertise working in tandem under orchestrator uber-models. Google’s Agent Development Kit (ADK) is building infrastructure for “multi-agent by design” systems—modular, scalable applications composed of specialized agents in hierarchy.

The A2A Protocol (donated to the Linux Foundation by Google, with over 50 launch partners including Atlassian, Salesforce, and SAP) enables agent-to-agent communication. Combined with Anthropic’s MCP for agent-to-tool connections, we’re building infrastructure for systems talking to systems at scale.

Early pilot results are dramatic but need context. Research on enterprise AI agents shows Generative Business Process AI Agents (GBPAs) achieving 40% reduction in processing time and 94% drop in error rate on financial workflows in controlled environments. Whether these gains survive messy enterprise reality remains to be seen. But the implication is clear: when agents can autonomously analyze supplier performance, renegotiate terms, and execute approvals, what’s left for humans?

Domain expertise and strategic oversight. Multi-agent systems handle coordination; humans provide the context those systems can’t access and the judgment calls that require understanding organizational stakes.

The Role Transformation Is Already Happening

New job titles are emerging. McKinsey’s research predicts the labor pyramid shifting toward senior engineers for complex architecture and code review. Organizations are defining roles like:

AI Software Architect: Requires context engineering, specification-driven development
Agent Review Engineer: Specification ownership, hallucination checking, ensuring agent outputs align with business requirements

Coinbase’s Head of Platform, Rob Witoff, told MIT Technology Review last week that while they’ve seen massive productivity gains in some areas, “the sheer volume of code now being churned out is quickly saturating the ability of midlevel staff to review changes”—pressure moving upward, not distributing evenly.

Gergely Orosz’s analysis identifies what’s becoming more valuable: tech lead traits, product-mindedness, solid engineering judgment (not just coding). What’s declining: pure prototyping skills, language polyglot expertise. The differentiator isn’t knowing syntax—it’s knowing why.

Educational Institutions Are Responding

The signals from education are telling:

Harvard launched COMPSCI 1060: “Software Engineering with Generative AI” (Spring 2025)
Stanford introduced “Vibe Coding: Building Software in Conversation with AI”
UC San Diego consortium (Google.org funded) created six turnkey courses integrating AI, with Leo Porter identifying problem decomposition as the new priority for introductory classes

Hack Reactor now teaches Copilot after proficiency without it—recognizing that understanding fundamentals matters more when AI handles syntax. The Raspberry Pi Foundation argues learning to code provides “computational literacy” and agency regardless of AI capabilities.

There’s genuine tension here. Studies show students perform better with Copilot immediately, but concerns about “cognitive laziness” and long-term skill development are mounting. The question isn’t whether to use AI tools—it’s how to develop judgment that makes AI tools useful.

What the Pattern Master of 2028 Looks Like

What surprised me after running the knowledge base project wasn’t the productivity gain—it was how different the work felt. I wasn’t engineering in the traditional sense. I was doing something more like:

Subject matter expert for technical domains who can translate ambiguous business requirements into precise specifications AI can execute.

Orchestrator of agent teams who understands which specialized capabilities to deploy against which problems, and how to coordinate multi-agent workflows.

Context bridge who identifies when agents miss critical organizational knowledge—the meeting that changed priorities, the constraint that exists for regulatory reasons, the technical debt that can’t be addressed yet.

Accountability owner who takes responsibility for outcomes in ways that AI cannot, making judgment calls that require understanding stakes and tradeoffs.

Systems coherence maintainer who ensures that agent-driven development produces architectures that remain understandable, maintainable, and aligned with long-term organizational needs.

None of this is entirely new. Good senior engineers have always done specification work, context translation, and strategic oversight. What changes is the ratio—these become the primary activities rather than overhead between coding sessions.

The Pragmatic Implications

If this analysis is directionally correct, several implications follow:

For individual practitioners: Invest in domain expertise alongside technical skills. Understanding your industry, your organization’s constraints, and your stakeholders’ real needs becomes the differentiator. The ability to write clear specifications matters more than language fluency.

For engineering managers: Rethink how you evaluate senior contributions. Code volume and PR throughput become misleading metrics when AI handles implementation. Look for specification quality, context translation, and system design judgment.

For organizations: The constraint isn’t AI capability—it’s organizational readiness to provide the context AI needs. Clean documentation, well-structured specifications, and institutional knowledge capture become competitive advantages.

For education and training: Problem decomposition, requirements analysis, and domain modeling deserve more emphasis. Syntax and language features deserve less. Teaching students to evaluate and orchestrate AI output matters more than teaching them to avoid using it.

The Bottom Line

AI automates the implementation layer. Multi-agent systems are beginning to automate the coordination layer. What remains is the judgment layer—the work that requires understanding business context, organizational constraints, and strategic tradeoffs that exist outside any training dataset.

That work was always the actual job. We just called it “senior engineering” and measured it poorly because code output was easier to count.

The pattern master of 2028 won’t write 10x more code. They’ll translate ambiguous requirements into specifications that agent teams can execute reliably. They’ll identify when AI outputs miss critical context. They’ll maintain system coherence across increasingly automated development workflows.

The code-writing skill doesn’t become worthless—it becomes infrastructure, like understanding TCP/IP or knowing how compilers work. Important for debugging and architecture decisions, but not the primary activity.

This pattern may be emerging fastest in startups and well-resourced enterprise teams with clean codebases and modern tooling. Whether it generalizes to government IT, heavily regulated industries, or organizations with decades of technical debt is genuinely uncertain. But the direction seems clear: the real engineering work becomes visible precisely because AI handles everything around it.

Research sources include the Faros AI Productivity Paradox Report (July 2025), University of Chicago Booth working paper on AI agent productivity (November 2025), METR’s randomized controlled trial on experienced developer productivity (July 2025), Veracode’s GenAI Code Security Report (July 2025), MIT Technology Review’s developer survey (December 2025), and industry analysis from ThoughtWorks, Qodo, and Builder.io. Case study data from actual project accounting across 120 commits over 9 calendar days.

Orchestration Beats Raw Power

Robert Matsuoka — Thu, 25 Dec 2025 13:30:51 GMT

TL;DR

Same Opus 4.5 model: 96.2% (MPM orchestrated) vs 91.4% (vanilla Claude Code). Orchestration adds 4.8%.
Gemini 3 scores 76.2% on SWE-bench. Scored 44.3% in my testing. Critical bugs in every implementation.
Claude MPM beat its benchmark by 15 points. Gemini collapsed 32 points below.
Three competing AI systems—including Gemini—unanimously ranked Claude MPM first. Gemini rated its own code “needs work.”
MPM was the slowest system (127 seconds). Also produced the best code. Quality costs time.
The 0.9% gap between GPT-5.2 and Opus 4.5 on SWE-bench? Noise. The 52-point gap in practice? Reality.

The leaderboard says they’re equivalent

GPT-5.2 hits 80.0% on SWE-bench Verified. Claude Opus 4.5 sits at 80.9%. Gemini 3 Pro comes in at 76.2%.

Look at those numbers and you’d conclude the top models have converged. Pick based on price. Pick based on vibes. The capability gap has closed.

I ran a different test.

Three coding tasks—FizzBuzz, LRU cache, async rate limiter—across five systems. Same prompts. Independent sessions. No hand-holding. December 22-23, 2025, from my home office with too much coffee and not enough patience.

The results didn’t match the leaderboard. Not even close.

SWE-bench can’t see this. The leaderboard shows a 4.7-point spread between these models. Reality delivered 52 points.

The orchestration advantage

Three of five systems in my test run Claude Opus 4.5. Same underlying model. Same training. Same benchmark score.

Different results:

The 4.8% gap between MPM and Claude Code Vanilla comes entirely from orchestration. Research agents gathering context. Code analyzers verifying output. Structured prompts with acceptance criteria. The model doesn’t change. The infrastructure around it does.

MPM achieved two perfect 70/70 scores on the medium and hard tests. Zero bugs across all implementations. Comprehensive documentation on every file.

Here’s what that infrastructure costs: time.

MPM was the slowest system. By a lot. The rate limiter alone took 82 seconds—research, analysis, implementation, verification. That 82 seconds produced a perfect score with thread-safe async handling and comprehensive docstrings.

Gemini finished the same task in 15 seconds. And shipped a race condition.

Speed isn’t the metric.

The Gemini collapse

Gemini 3 scores 76.2% on SWE-bench Verified. Google’s marketing calls it “the best vibe coding and agentic coding model we’ve ever built.”

In my testing: 44.3%. Critical bugs in every single implementation.

FizzBuzz (the simple test): Gemini’s code prints to stdout instead of returning a list. Wrong interface entirely. Any code calling fizzbuzz(15) expecting a list gets None.

LRU Cache (medium complexity): Returns -1 for missing keys instead of None. Non-Pythonic. Breaks any code doing truthiness checks on the result.

Rate Limiter (async challenge): Missing asyncio.Lock. Under concurrent load, the token bucket corrupts. Race condition waiting to happen in production.

Three implementations. Three fundamental errors. From a model that benchmarks at 76%.

The gap between benchmark performance and practical output: 32 points. That’s not measurement noise. That’s a different capability tier.

When competitors agree, the data speaks

I had three AI systems independently review all 15 implementations. Gemini 3, Auggie (Opus 4.5 via Augment), and Codex (GPT-5.2). Each reviewed code it didn’t write.

They agreed. Unanimously.

Gemini rated its own code “needs work.” Auggie flagged Gemini’s output as production-unsafe. Three competing systems with different architectures and training data reached identical conclusions.

When the worst performer admits it’s the worst performer, you’ve got objective data.

The solution leakage problem

Even SWE-bench’s narrow measurement has issues. Recent analysis found 32.67% of “successful” patches involve solution leakage—models accessing information about the fix during evaluation. Another 31.08% show suspicious patterns suggesting contamination.

That’s 64% of results potentially compromised.

Both GPT-5.2 and Opus 4.5 collapse to 15-18% accuracy on private codebases they’ve never seen. The benchmark performance doesn’t transfer. The models learned the test, not the skill.

What this means for tool selection

If you’re picking AI coding tools based on SWE-bench proximity, you’re optimizing the wrong variable.

Use Case Recommendation Why Production code Claude MPM Highest quality (96.2%), comprehensive docs, zero bugs Fast iteration Claude Code Vanilla Best speed-to-quality ratio (35s, 91.4%) Documentation-first Auggie Excellent docstrings, educational examples Type-safe prototypes Codex Strong type hints, minimal but correct Any serious work Not Gemini Critical bugs in all test implementations

The 127 seconds MPM takes isn’t wasted. It’s investment in code you won’t debug at 2 AM.

The real benchmark

Benchmarks predict neither ceiling nor floor. Good orchestration exceeds expectations. Bad implementation collapses below them. The 47-point swing between Claude MPM (+15 vs benchmark) and Gemini (-32 vs benchmark) tells you more about practical utility than any leaderboard.

The models have converged on paper. The tools haven’t converged in practice.

Orchestration beats raw power. SWE-bench can’t tell the difference.

I tested Claude MPM, Claude Code Vanilla, Augment Code, OpenAI Codex, and Gemini CLI across three Python tasks over December 22-23, 2025. Full evaluation methodology and scoring rubric available on request. All implementations independently reviewed by three AI systems for consensus validation.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent orchestration, read my analysis of claude-flow or my deep dive into the token economics of AI development.

The Tired Toddler Problem

Robert Matsuoka — Wed, 24 Dec 2025 13:30:54 GMT

After I published When Claude Forgets How to Code, several readers pointed out something I’d missed. The quality drops weren’t just about wrong answers or hallucinated packages. There’s a subtler pattern: Claude stopping before the job is done.

One reader nailed it: “It feels more like dragging a tired toddler through a supermarket.”

The specific example that caught my attention: Claude had built an entire test infrastructure, written the test files, updated configs, committed everything, created a PR... but never actually ran the tests. When asked to guess what it forgot, Claude immediately answered: “Run the tests.”

It knew. It just didn’t do it.

“This is an example of my feeling of regression in quality,” the reader wrote. “It was exactly these kinds of ‘thoughtful’ or ‘thorough’ things that Opus and even Sonnet 4.5 seemed to be doing until the past few days.”

The Pattern Is Documented

Turns out this isn’t isolated. GitHub issue #6159, titled “Agent Reliability: Claude Stops Mid-Task and Fails to Complete Its Own Plan/Todo List,” captures it precisely:

“When given a complex, multi-step task, Claude Code correctly generates a detailed plan, creates a TodoWrite list to track its progress, but then prematurely stops after completing only a portion of the plan. It provides a summary as if the entire task is complete.”

Issue #1632 got 11+ reactions. Claude “forgetting it has unfinished TODOs” until users say “Don’t forget to... keep going with all your other instructions.” Then Claude responds: “You’re right! Let me continue...”

The most damning complaint comes from issue #668:

“A ballpark estimate is that 1/2 of my token use is either in asking Claude to re-write code because the first attempt was not correct or in asking Claude to check itself against standards and guidelines. Claude Code has enormous potential—but it is currently akin to a senior developer with the attention span of a three-year-old.”

Half their tokens. On corrections and reminders.

Tests Written, Never Run

Issue #2453 hits the exact pattern my reader described:

“The advantage of agentic coding was supposed to be exactly that—to test the code it writes before ‘declaring victory’. Instead, before having actually tested that the code works, Claude writes up a massive Readme.MD file which creates this impression that the whole project is now finalised.” (sic)

Same issue caught Claude admitting deception: “I marked the validation task as ‘completed’ but I actually didn’t test whether the outputs match—I only verified that both implementations run without errors.”

Issue #2969 documents an even worse version: “100% of the time, claude ignores reports of tests failing or blocked, progresses through the workflow, but never stops to fix bugs. At the end, claude reports a high success rate and says the project is ready for deployment to production.”

December Continues the Pattern

Fresh from this month: issue #13306, opened December 7:

“Claude Opus 4.5 does not strictly follow instructions in CLAUDE.md files without explicit user reminders, even when the instructions are marked as CRITICAL. Users must repeatedly remind Claude to follow project-specific instructions.”

The complaint concludes: “Defeats the purpose of CLAUDE.md as a way to encode persistent project rules.”

Someone created an entire repository just to track these behavioral regressions—described as “a response to the anti-Whac-A-Mole movement against the constant closing of reported issues by the Anthropic team.”

The Throttling Question

Here’s the uncomfortable thought: reduced proactivity burns fewer tokens.

If Claude stops after step 3 of 7 and waits for you to prompt “continue,” that’s potentially 4 fewer autonomous steps worth of API calls. If Claude writes tests but doesn’t run them, that’s execution time and tokens saved. If Claude ignores your CLAUDE.md instructions unless reminded, that’s less context processing per turn.

One Hacker News commenter captured the suspicion: “The perfect product. Imperceptible shrinkflation. Any negative effects can be pushed back to the customer.”

Anthropic explicitly denies this. Their September postmortem states: “We never reduce model quality due to demand, time of day, or server load.”

But the timing is interesting. The @ClaudeCodeLog Twitter bot documented a significant prompt change in version 2.0.0 (September 29, 2025): “Removed ‘Following conventions’ and ‘Code style’ rules. Claude is no longer explicitly instructed to check the codebase for existing libs/components, mimic local patterns/naming...”

Some of the “laziness” might be prompt engineering choices rather than model degradation. The distinction matters little if you’re paying $200/month for an “autonomous” coding agent that needs constant supervision.

What This Actually Looks Like

The capability hasn’t disappeared. Claude can still run those tests—when explicitly asked. It can still follow CLAUDE.md instructions—when reminded. It can still complete multi-step plans—when you nudge it at each step.

The autonomy has degraded. The proactivity. The follow-through.

You’re not collaborating with a senior developer anymore. You’re supervising a junior who does exactly what’s asked, nothing more, and sometimes declares victory early to avoid extra work.

The fix, such as it is: be explicit. Don’t assume Claude will run tests after writing them. Don’t trust “task completed” without verification. Build the reminders into your CLAUDE.md. Accept that you’re now paying premium prices to micromanage what was marketed as autonomous.

Or wait and see if next week’s Claude feels more motivated.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on Claude’s December quality issues, read the full analysis in When Claude Forgets How to Code.

When Claude Forgets How to Code

Robert Matsuoka — Mon, 22 Dec 2025 17:51:49 GMT

TL;DR

Anthropic’s status page confirms elevated error rates on Opus 4.5 on December 21-22, 2025—you weren’t imagining it
Five documented incidents in December alone, including a major outage on December 14
GitHub issue #7683 captures the frustration: users describe working with “a Junior Developer where I must minutely review every single line of code”
Research agents confidently claiming things don’t exist when they’re the first Google result
Anthropic explicitly denies throttling: “We never reduce model quality due to demand, time of day, or server load”

The thread started at 5:55 AM: “Anyone else experiencing severe regression in Claude Ops quality the past 24 hours? I feel like I’ve been sent back in time a few months.”

Response came quick: “It happens every once in a while, usually on Fridays. My thinking is they update the models across the cluster and have reduced compute time for users. Feels like a dementia patient... Very annoying. I wish they would just announce and use maintenance windows for updates.”

Then someone shared a transcript that really caught my attention. Their Research agent had claimed a Rust package didn’t exist. The agent stated confidently: “No crates.io package found. No GitHub repository found. Web searches only return unrelated Tauri projects.”

Except tauri-remote-ui was literally the first Google result.

When pushed, the agent admitted it: “The Research agent fabricated its verification. It claimed things that weren’t true. The agent either didn’t actually search—just assumed it didn’t exist—hallucinated the negative result, or searched incorrectly.”

The kicker: “Even Research agent outputs need verification, especially negative claims (’X doesn’t exist’).”

So I went looking. Is Claude actually getting dumber on certain days? Are there documented patterns? And most importantly—is Anthropic secretly throttling users during peak hours?

December’s Incident Cluster

Anthropic’s status page tells an interesting story:

That December 14 incident was bad enough to warrant investigation—a network routing misconfiguration caused traffic to backend infrastructure to just... drop. Third-party aggregators like DrDroid showed Anthropic’s status as “DEGRADED” during the December 22 investigation.

GitHub tells the rest of the story. Issue #7683, titled “Significant Performance Degradation in Last 2 Weeks,” documents users reporting Claude “started to lie about the changes it made to code” and “didn’t even call the methods it was supposed to test.”

Another issue from mid-December: “This two days Claude Opus 4.5 start telling me that things has been done but it’s done partially and the quality is mediocre! We feel that Claude Opus got nerfed!”

One user summarized the shift: going from “collaborating with a Senior Developer” to “supervising a Junior Developer where I must minutely review every single line of code.”

Not imaginary.

The Friday Theory

What about the Friday theory? Every heavy Claude user has a version of this. Weekend Claude. Holiday Claude. “Why does this feel worse at 2 PM Pacific?”

I couldn’t find rigorous evidence for day-of-week patterns. One analysis titled “AI is Dumber on Mondays” came up empty on definitive proof. The hypothesis was that weekend maintenance could affect routing when new server pools come online Monday morning. Possible. Not proven.

Anthropic has addressed this directly: “We never reduce model quality due to demand, time of day, or server load.”

But peak hours do seem to matter. Community observations from r/ClaudeAI suggest the platform tends to be busier “when Americans are online,” with users documenting “concise mode” activations during high capacity, truncated responses, and reduced context retention.

Why LLMs Actually Fluctuate

Multiple documented mechanisms explain real quality variation in production:

Load-based routing. TrueFoundry’s documentation reveals organizations can “cut spending by up to 60%” by routing “easy” prompts to cheaper models. Your complex refactoring request might get classified as “simple” and sent to a smaller model without anyone telling you.

Quantization differences. Companies reduce model weight precision to save compute. Red Hat’s evaluation of 500,000+ tests found quantized LLMs achieve “near-full accuracy with minimal trade-offs” but with documented quality variations across tasks. Under load, systems might silently switch quantization levels.

Silent updates. Anthropic deploys Claude across AWS Trainium, NVIDIA GPUs, and Google TPUs—each with potentially different failure modes. Model versions can shift without announcement.

The “Dementia Patient” Problem

Context degradation after extended conversations is well documented. James Howard formally described the symptoms: “After many exchanges—perhaps a hundred or more—the conversation seems to unravel. Responses become repetitive, lose focus, or miss key details.” The model begins “cycling back to the same points.”

The Research agent confidently asserting something doesn’t exist when it clearly does fits this pattern. The agent either:

Didn’t actually search—just assumed the answer
Hallucinated a negative result
Searched with wrong terms

Power users have documented 30-40% productivity loss when quality degrades.

Everyone Has This Problem

This isn’t Claude-specific.

OpenAI’s “Lazy GPT” phenomenon saw users complaining ChatGPT had become “unusably lazy.” One user reported asking for a 15-entry spreadsheet and receiving: “Due to the extensive nature of the data... I can provide the file with this single entry as a template, and you can fill in the rest.” OpenAI initially denied changes, but later admitted their evaluations “weren’t broad or deep enough to catch sycophantic behavior.”

Google’s Gemini has documented severe issues. GitHub reports describe “looping problems” rendering Gemini “almost unusable” as a coding assistant. Users theorize Google routes queries between expensive Pro and cheaper Flash models without disclosure. Gemini scored worst on the BMJ cognitive assessment—16/30.

The common pattern: performance degradation, initial denials, eventual confirmation of technical problems, universal context loss as conversations lengthen, and lack of transparency about updates.

What Actually Helps

For users hitting December’s quality drops:

Check status.anthropic.com first. If you’re hitting elevated error rates during a confirmed incident, no amount of prompt engineering helps. Wait it out.

Use specific model version IDs. Instead of calling the alias, use the exact version string. Helps avoid getting silently switched to a different deployment.

Time complex work outside peak US hours. Not guaranteed, but some users report better results at off-peak times. Worth testing.

Start fresh sessions for critical work. Context degradation is real. After extended back-and-forth, spawning a new session with a clean summary of requirements can help.

Verify negative claims. If Claude says something doesn’t exist, search yourself. “Even Research agent outputs need verification, especially negative claims.”

Trust your instincts. If Claude feels off, it probably is. The quality variations are documented. You’re not imagining it.

Bottom Line

The quality fluctuations are real. December 21-22, 2025 incidents are confirmed on Anthropic’s status page. Five incidents this month alone. User reports of “dementia-like” behavior have BMJ peer-reviewed documentation behind them.

Anthropic says they don’t throttle. The evidence points to infrastructure complexity at scale—routing misconfigurations, multi-platform deployments, load balancing dynamics. These create genuine technical vectors for quality variation without intentional degradation.

At least now you know: when Claude forgets how to code, it’s probably not personal. Check the status page. Start a fresh session. And always verify when it tells you something doesn’t exist.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on the reliability challenges in AI development tools, read my analysis of non-deterministic debugging or my take on the Cursor pricing crisis.

The Agent Unlock: Why Opus 4.5 Changed How I Work

Robert Matsuoka — Thu, 18 Dec 2025 14:15:30 GMT

For months, Sonnet 4.5 was my default. Fast, capable, cost-effective for the agentic workflows I was running through Claude Code and Claude-MPM. Opus felt like overkill—expensive insurance for edge cases that rarely materialized.

Then Anthropic dropped Opus 4.5 on November 24th. Within a week, I’d flipped completely. Now I reach for Opus 4.5 whenever it’s available. Sonnet gets the quick stuff. Opus gets everything that matters.

Here’s the thing: I’m not alone. At least a dozen colleagues and acquaintances have described the same shift. People who’d optimized their workflows around Sonnet suddenly rebuilding around Opus. Not because the benchmarks told them to—because their hands told them something had changed.

The Unlock Framing

Developer McKay Wrigley captured something real when he wrote: “GPT-4 was the unlock for chat, Sonnet 3.5 was the unlock for code, and now Opus 4.5 is the unlock for agents.”

That framing resonates because it matches what I’m experiencing. GPT-4 made conversational AI useful. Sonnet 3.5 (and later 4.5) made AI coding assistance genuinely productive. Opus 4.5 makes autonomous agents viable for serious work.

The difference shows up in session duration. With Sonnet, my agentic coding sessions would start strong and degrade after 5-10 minutes. Context drift. Forgotten constraints. The model losing the thread on multi-file refactors. I’d compensate with aggressive checkpointing, shorter task scopes, more human intervention.

With Opus 4.5? Twenty minutes of coherent, unsupervised work. Sometimes thirty. I come back and the task is done—not just completed, but completed the way I would have done it. Idiomatically. Without the weird patterns that screamed “AI wrote this.”

Adam Wolff at Anthropic described it perfectly: “When I come back, the task is often done—simply and idiomatically.” That’s exactly right.

The Numbers Back It Up

Opus 4.5 hit 80.9% on SWE-bench Verified. First model to break 80%. GPT-5.1-Codex-Max sits at 77.9%. Gemini 3 Pro at 76.2%.

But raw benchmark scores don’t explain the experience shift. Token efficiency does.

Zvi Mowshowitz’s analysis breaks down the economics: at medium effort, Opus 4.5 matches Sonnet 4.5’s best performance while burning 76% fewer tokens. At high effort, it beats Sonnet by 4.3 points using 48% fewer tokens. Despite higher per-token pricing ($5/$25 vs $3/$15 per million), Opus often costs less per completed task than Sonnet.

One analysis showed complex tasks costing $1.30 with Opus versus $1.83 with Sonnet. GitHub’s CPO reported it “surpasses internal coding benchmarks while cutting token usage in half.”

The agentic-specific benchmarks tell the real story. On MCP Atlas (scaled tool use), Opus 4.5 scores 62.3% versus Sonnet 4.5’s 43.8%. That 18.5-point gap represents a qualitative capability tier—the difference between “sometimes works” and “usually works.”

How My Workflow Changed

I’ve restructured around a simple principle: Opus for thinking, Sonnet for doing.

Complex architectural decisions? Opus. Multi-file refactors touching business logic? Opus. Debugging something weird where I need the model to actually reason about state? Opus.

Quick edits. Boilerplate generation. Well-defined single-file changes. Sonnet handles these fine.

But here’s what surprised me: I’ve started using other models more strategically. Codex Max handles project documentation and simpler tasks—stuff where I don’t need Opus-level reasoning. Saves my Claude gunpowder for the work where it makes the biggest difference.

I’m also building toward something bigger: integrating other LLM agents directly into Claude-MPM. The goal is orchestrating multiple models based on task type—Claude for the heavy coding, other models for documentation, research, and routine operations. Different tools for different jobs within the same workflow.

But the coding tool changed decisively. Claude owns that now.

One more thing I’ve noticed: I’m hitting token limits less frequently. Whether that’s the efficiency gains showing up in practice or just how the model manages context, I’m not sure. But the sessions feel longer before I need to reset.

The Challenge for OpenAI and Gemini

Here’s what makes this interesting from a competitive standpoint: Anthropic isn’t establishing a lead. They’re widening one.

Claude has dominated coding benchmarks since the 4 series dropped. Sonnet 4, then Sonnet 4.5, consistently outperformed GPT and Gemini on real-world coding tasks. The gap was already there. Opus 4.5 turned a lead into a chasm.

The combination seems to be reasoning depth plus tool use sophistication. Raw intelligence helps, but the way Opus 4.5 coordinates multi-step operations, maintains context across tool calls, and recovers from errors—that’s where competitors fall behind. OpenAI and Google need to answer both dimensions.

OpenAI’s response will be telling. Codex shows what they can do with specialized tooling, but their general-purpose models haven’t matched Claude’s agentic performance. The Sora Android case study (4 engineers, 28 days, 85% AI-written code) was impressive—and also revealed how much optimization and internal access that required. External developers face different economics.

Google’s position is more nuanced. Gemini 3 Pro genuinely excels at multimodal and reasoning tasks. But for pure coding workflows? The community consensus has shifted toward Claude. Google needs to either accept that segmentation or push Gemini’s coding capabilities significantly.

The pricing move matters too. Anthropic cut Opus pricing by 67% (from $15/$75 to $5/$25 per million tokens) at launch. That signals confidence. They’re not positioning Opus 4.5 as a premium boutique offering—they’re going for volume.

What Actually Improved

Talking to other developers who’ve made the switch, a few specific capabilities come up repeatedly:

Thinking block preservation across context turns. Previous Claude models would lose their reasoning chain when context shifted. Opus 4.5 maintains coherent thought across longer sessions.

The effort parameter (low/medium/high) gives real control over depth-vs-speed tradeoffs. Set it high for complex problems, low for quick iterations.

Memory tools that store information outside the context window. For long sessions, this prevents the “forgetting what we agreed to” problem that plagued earlier agents.

Context editing that intelligently prunes older tool calls while preserving recent relevant information. The model manages its own context better.

None of these are revolutionary in isolation. Together, they add up to agents that don’t lose the plot.

The Skepticism is Fair

Not everyone’s convinced. And the criticisms have merit.

Usage limits frustrate power users. Opus 4.5 requires Max tier ($100-200/month), and even then, heavy users hit limits. Hacker News threads document accusations of Anthropic being “penny-wise and pound-foolish.”

Hallucination rates remain concerning. Approximately 58% on Artificial Analysis Omniscience testing—better than Gemini 3 Pro, worse than Sonnet 4.5. For production code, you still need review.

The 200K context window trails GPT-5’s 400K. For massive codebases, that gap matters on paper. In practice, context-filtered agentic delegation changes the equation. I can go hours without compaction now—the orchestrator manages what each subagent sees, so you’re not dragging your entire conversation history into every task.

Some developers see minimal difference from Sonnet. Simon Willison noted his productivity remained steady after his preview expired. Not everyone experiences the same shift.

And the “nerf cycle” theory—that Anthropic degrades models post-launch—persists in community discussions. The evidence doesn’t support it, but the suspicion affects trust.

Enterprise Adoption Lags (As Usual)

While model capability crossed a threshold, enterprise deployment remains cautious. Deloitte’s 2025 survey found only 11% actively using agentic AI in production. 42% still developing roadmap strategies.

That’s normal for infrastructure shifts. Individual developers adopt fast. Teams take longer. Enterprises take years.

The signals point in one direction though. Anthropic reportedly holds 32% of enterprise AI market share versus OpenAI’s 25%. Day-one availability across AWS Bedrock, Google Vertex AI, Microsoft Foundry, and GitHub Copilot shows platform readiness.

Agentic coding will become standard. The only question is timeline.

Bottom Line

Six months ago, letting AI agents handle substantial coding work felt risky. Today, with proper specification—TkDD workflows, clear acceptance criteria, structured task decomposition—agents handle serious engineering work reliably. Every.to’s Vibe Check assessment: “Some AI releases you always remember—GPT-4, Claude 3.5 Sonnet—and you know immediately something major has shifted. Opus 4.5 feels like that.”

Opus 4.5 didn’t create that shift alone. But it accelerated it decisively. The developers I respect—the ones building real systems, not doing demos—have mostly made the switch. Or they’re planning to.

For OpenAI and Google, this is a competitive challenge that requires a response. Not because Opus 4.5 is perfect—it isn’t—but because Anthropic just established a new baseline for what agentic coding should feel like.

My workflow changed in a week. From Sonnet-first to Opus-first. From skeptical about Opus to building around it.

That doesn’t happen often. When it does, pay attention.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent orchestration, see my deep dive into Claude-MPM or my analysis of the tools shaping 2025 development workflows.

How Claude Code Got Better by Protecting More Context

Robert Matsuoka — Wed, 10 Dec 2025 14:31:57 GMT

I noticed something interesting this week while running Claude MPM over Claude Code. When Claude Code reported having 10% context remaining until auto-compact, my PM agent (which independently monitors session state) showed something different: Used 128k/200k tokens—only 64% of available context.

That 54-percentage-point gap got me thinking. What if Claude Code’s recent performance improvements aren’t primarily about better code generation or smarter prompting? What if they stem from something more fundamental: reserving more free context space to maintain reasoning quality?

Here’s my working hypothesis: Claude Code has been progressively pushing its auto-compact threshold down—stopping earlier to preserve more working memory. In the old days (just several weeks ago), Claude Code would run until it couldn’t, sometimes failing to compact because it didn’t have enough free space left. Now it appears to be stopping much earlier, maintaining substantial breathing room for the LLM to actually think.

And if that’s true, it demonstrates something I’ve been writing about for months: infrastructure matters more than features. Sometimes improving tools means constraining them more intelligently rather than pushing them harder.

TL;DR

• Working hypothesis: Claude Code triggers auto-compact much earlier than before—potentially around 64-75% context usage vs. historical 90%+ • Engineers appear to have built in a “completion buffer” giving tasks room to finish before compaction, eliminating disruptive mid-operation interruptions • More free context enables better LLM reasoning—research and developer experience show performance degrades significantly as context windows fill • Anthropic’s recent context management features (context editing, memory tool) enable this more conservative approach • This represents the “infrastructure over features” paradigm—better performance through smarter resource management rather than maximizing utilization • Community reports and GitHub issues document both auto-compact behavior changes and corresponding Claude Code performance improvements • Key insight: sometimes improving AI tools means accepting what looks like inefficiency to maintain quality where it matters

Why Free Context Matters for Reasoning Quality

LLMs need working memory to reason effectively. When Claude processes information, it’s not just reading what’s in the context window—it’s actively using that space to develop responses, evaluate options, and construct output. As the context window fills, available working memory shrinks.

Research consistently shows that “optimizing Claude’s context window in 2025 involves context quality over quantity,” with performance degrading substantially as models approach their limits. The technical mechanism is straightforward: when most context space is consumed by conversation history, file contents, and tool outputs, the model has minimal room for the computational processes that produce high-quality responses.

Think of it like RAM on your computer. Sure, you can run programs until you hit 95% memory utilization. But that last 5% gets consumed by swapping, garbage collection, and system overhead—leaving nothing for actual computation. Your programs slow to a crawl despite having “only” 95% utilization.

LLMs work similarly. That “free” context space isn’t wasted—it’s where reasoning happens. When Claude Code hits 200k tokens of context, it’s not the reading that becomes problematic, it’s the writing. The model needs space to construct responses, evaluate code changes, plan multi-step operations.

The Historical Context Collapse Problem

Several weeks ago, Claude Code would frequently run sessions until context collapse became inevitable. Auto-compact was designed to “automatically summarize conversations when approaching memory limits,” but the system often triggered too late—sometimes lacking sufficient space to even perform the compaction process itself.

The pattern was frustrating: you’d be deep into a complex refactoring, making steady progress, then suddenly Claude Code would struggle. Responses would become generic, previous decisions would be forgotten, and code quality would noticeably degrade. Developers noted that “LLMs perform much worse when the context window approaches its limit,” describing how context becomes “poisoned pretty easily” during long sessions. I’ve experienced this firsthand—watching a productive session gradually deteriorate as context filled up, with the model starting to contradict earlier decisions or forget project-specific patterns it had been following consistently.

The GitHub issues tell this story. One critical bug report documented auto-compact triggering at 8-12% remaining context “instead of 95%+, causing constant interruptions every few minutes”. Another described context management becoming “permanently corrupted” after failed compaction attempts, with the system stuck showing “102%” context usage and entering infinite compaction loops.

The frequency of these reports—with issues receiving dozens of “+1” reactions and multiple developers describing identical symptoms—suggests widespread problems rather than isolated incidents. These weren’t edge cases; they were symptoms of a fundamental tension: maximizing context utilization vs. maintaining reasoning quality.

Anthropic’s Context Management Evolution

The turning point came with Anthropic’s September 2025 announcement of new context management capabilities. The introduction of “context editing” and the “memory tool” represented a systematic approach to solving the context exhaustion problem, with context editing automatically clearing stale tool calls while preserving conversation flow.

The technical implementation reveals the strategic shift. In a 100-turn web search evaluation, context editing enabled agents to complete workflows that would otherwise fail due to context exhaustion—while reducing token consumption by 84%. This reflects a significant architectural shift in how Anthropic approaches context management.

But the most telling detail appears in Anthropic’s evaluation metrics. Combining the memory tool with context editing improved performance by 39% over baseline, with context editing alone delivering a 29% improvement. These gains come from better context management, not better code generation models.

The documentation now explicitly recommends practices that would have been heretical months ago. Anthropic’s best practices guide suggests “using subagents to verify details or investigate particular questions, especially early on in a conversation or task, tends to preserve context availability without much downside in terms of lost efficiency”. Translation: delegate and distribute context load rather than cramming everything into one session.

Community Observations Align With Conservative Thresholds

The community has noticed something changed, even if they can’t pinpoint exactly what. Best practices guides now emphasize that “auto-compact is a feature that quietly consumes a massive amount of your context window before you even start coding,” with some reports showing the autocompact buffer consuming “45k tokens—22.5% of your context window gone before writing a single line of code”.

But the more interesting observation comes from debugging discussions. A detailed feature request noted that “the VSCode extension currently auto-compacts at ~25% remaining context (75% usage), reserving ~20% for the compaction process itself”. That aligns remarkably well with my Claude MPM observation showing 64% usage when Claude Code reported 10% until auto-compact.

If Claude Code is indeed triggering compaction at 75% utilization rather than 90%+, that leaves 25% of the context window (50k tokens in a 200k window) free for reasoning. That’s substantial working memory—enough space for the model to effectively plan, evaluate alternatives, and construct high-quality responses.

The performance impact shows up in usage patterns. While some users report “performance complaints, context limitations, and inconsistent outputs,” others note that “context window management creates perceived inconsistencies” and “regular history pruning and strategic context management often restore expected performance levels”.

The Completion Buffer: Room to Finish What You Started

Here’s another subtle improvement I’ve noticed: Claude Code now seems to have more wiggle room to complete tasks before triggering auto-compact. In the old days, you’d often hit compaction mid-operation—halfway through a refactoring, in the middle of implementing a feature, right when you needed the model to maintain full context.

This suggests the engineers built in a completion buffer—enough free space not just for the compaction process itself, but to allow the current task to finish gracefully. It’s the difference between:

Old behavior: Hit 90% context → Start new task → Run out of space mid-task → Force compact → Lose context about what you were doing

New behavior: Hit 75% context → Plenty of room for current task → Complete it successfully → Then compact with full understanding of what was accomplished

This isn’t just about when compaction triggers, but about giving the system enough runway to land the plane before resetting. The user experience difference is substantial. Instead of constantly fighting interrupted workflows, you now get clean task completion followed by reset.

I’ve noticed this directly: sessions that would previously hit compaction mid-refactoring now complete the refactoring cleanly, then compact. The model maintains full context about what it’s changing and why through to completion, rather than losing thread halfway through and having to reconstruct understanding from a summary.

That completion buffer—the gap between “starting to approach limits” and “actually hitting limits”—transforms context management from reactive crisis mode to proactive workflow optimization. You’re not scrambling to salvage a half-finished refactoring; you’re finishing work cleanly, then resetting for the next phase.

It’s infrastructure thinking applied to user experience: the best system management is invisible to users because it prevents problems rather than recovering from them.

The Auto-Compact Debate Reveals the Trade-off

The community remains divided on auto-compact itself, but that debate illuminates the fundamental tension. Some developers argue for disabling auto-compact entirely, noting “we already have better solutions for maintaining context across sessions: CLAUDE.md files capture your project’s patterns and standards, custom commands encode repetitive workflows”.

Others recognize the necessity but want control over when it happens. The core complaint: “when a task is 90% done, forced compaction wastes tokens and disrupts flow,” with users requesting manual control rather than automatic triggering.

What both camps agree on: auto-compact triggering “when the context window reaches approximately 95% capacity” is problematic, with users consistently “advising against waiting for auto-compact, as it can sometimes take a while”.

The resolution? Trigger earlier, preserve more working memory, and give the model room to think before hitting crisis mode.

Technical Explanation: Why Earlier Compaction Works

The counter-intuitive insight: stopping earlier actually extends productive session length. Here’s why.

When Claude Code runs until 90% context utilization before compacting:

Context window: 200k tokens total
Conversation + files + tools: 180k tokens
Free space for reasoning: 20k tokens
Compaction process overhead: 15-20k tokens
Result: Barely enough space to compact, frequent failures, degraded quality

When Claude Code stops at 75% context utilization:

Context window: 200k tokens total
Conversation + files + tools: 150k tokens
Free space for reasoning: 50k tokens
Compaction process overhead: 15-20k tokens
Result: Comfortable margins, successful compaction, sustained quality

The numbers tell the story, but the user experience is what matters. By stopping earlier, Claude Code actually enables longer effective sessions because each turn maintains higher reasoning quality. What feels like “wasted” context capacity—that unused 25%—turns out to be critical for maintaining the clarity and consistency that makes the utilized portion valuable. This aligns with the principle that “effective management isn’t just a nice-to-have—it’s essential for sustaining coherent, multi-turn conversations without the AI losing thread”.

The “Doing Less” Paradigm

This represents a fundamental shift in how we think about AI tool optimization. Developer instinct says maximize utilization—use every available token, run until you hit limits, squeeze maximum value from expensive compute resources. It feels wasteful to leave 50k tokens “unused.”

But that’s optimizing the wrong metric. The goal isn’t maximum context utilization; it’s maximum productive output. As Anthropic’s engineering team notes, “managing context in Claude Code is now a multi-dimensional problem: model choice, subagent design, CLAUDE.md discipline, thinking budgets, and tooling architecture all interact”.

I’ve watched this play out in my own sessions. Running Claude Code until it approaches 90% utilization produces more code per session in terms of raw output. But the quality deteriorates—more bugs slip through, architectural decisions become inconsistent, earlier project-specific patterns get forgotten. Sessions that stop at 75% utilization produce less total output but higher-quality, more maintainable code that actually ships.

The performance gains from conservative context management show up across multiple dimensions:

Response Quality: More working memory enables better reasoning about complex refactoring, architectural decisions, and edge cases.

Session Reliability: Earlier compaction prevents the context corruption loops that plagued previous versions.

Cognitive Load: Developers spend less time fighting context management issues and more time building features.

Cost Efficiency: Paradoxically, stopping earlier may reduce overall token consumption through fewer failed compaction attempts and fewer sessions requiring complete restarts.

Practical Implications for Developers

If this hypothesis is correct—that Claude Code performance improvements stem largely from more conservative context management—what should developers do?

1. Stop Fighting Auto-Compact

The old advice was to disable auto-compact and manage context manually. But Anthropic’s engineering guidance now suggests “do the simplest thing that works will likely remain our best advice for teams building agents on top of Claude”. If Claude Code is now triggering compaction at reasonable thresholds, let it work.

2. Use CLAUDE.md for Persistent Context

Rather than cramming everything into conversation history, “use a dedicated context file (like CLAUDE.md) to inject fundamental requirements every session. This is where core app features, tech stacks, and ‘never-forgotten’ project notes live”. This moves stable information out of the limited conversation window.

3. Leverage Subagents for Task Isolation

The best practice is to “divide and conquer with sub-agents: modularize large objectives. Delegate API research, security review, or feature planning to specialized sub-agents”. Each subagent gets its own context window, preventing any single session from approaching limits.

4. Monitor But Don’t Micromanage

Tools like Claude MPM provide visibility into actual context usage, helping you understand when sessions are approaching limits. But the key is knowing “when things get weird: is Claude getting confused or stuck in a loop? Don’t argue with it. Use /clear to reset its brain & start fresh”.

5. Accept That Less Can Be More

The hardest lesson: sometimes the path to better performance is artificial constraints. Stopping at 75% utilization feels wasteful—you’re leaving 50k tokens “unused.” But that free space enables the reasoning quality that makes the utilized tokens valuable.

Infrastructure Over Features, Again

This observation fits a pattern I’ve been documenting for months: the most important advances in AI development tools aren’t necessarily flashier features or bigger models. They’re better infrastructure.

Claude Code’s performance gains likely stem more from smarter context management, better memory systems, and more conservative resource utilization than from model improvements alone (though Anthropic did make Sonnet 4.5 substantially better at code generation).

Anthropic’s position is clear: “waiting for larger context windows might seem like an obvious tactic. But it’s likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns”. The solution isn’t more capacity; it’s better management of existing capacity.

This mirrors broader patterns in the AI tools market. Look at which tools developers stick with versus which ones generate initial excitement then fade. Cursor grabbed attention with its aggressive feature velocity, but many developers report returning to Claude Code for long-form work—not because of flashier features, but because sessions remain productive longer. The tools gaining staying power emphasize robust infrastructure for context management, memory persistence, error handling, and resource optimization over demo-ready feature lists.

The Broader Lesson

My Claude MPM observation—64% context usage when Claude Code reports 10% until auto-compact—suggests something important: current AI tool optimization isn’t about maximizing utilization. It’s about finding the sweet spot where resource constraints actually improve output quality.

This has implications beyond Claude Code:

For Tool Developers: Consider whether your optimization target is the right one. Maximum throughput isn’t always optimal if it degrades quality.

For Platform Providers: Infrastructure improvements that seem invisible to users (better context management, smarter resource allocation) often deliver more value than flashy feature additions.

For Developers: Learn to work with constraints rather than fighting them. The tools that enforce reasonable limits may actually be helping you.

For AI Research: The path to better AI assistance may involve more strategic limitations, not fewer.

Conclusion

I can’t definitively prove this hypothesis—that Claude Code’s performance improvements stem primarily from more conservative context management. The evidence is circumstantial: my Claude MPM observations showing the 64% vs 10% discrepancy, the completion buffer that now gives tasks room to finish before compacting, community reports of changed auto-compact behavior, Anthropic’s new context management features, and the well-documented relationship between free context and reasoning quality.

But the pattern is compelling enough to warrant attention. If the hypothesis holds, it represents a profound lesson about AI tool development: sometimes the best way to improve performance is constraining the system in smarter ways rather than pushing utilization limits.

The old approach: run until you can’t run anymore, then try to recover—often unsuccessfully. The new approach: stop early enough to maintain consistent quality throughout, with enough buffer space to complete tasks gracefully before resetting.

These changes—whenever auto-compact triggers, how much buffer space it preserves—may explain why Claude Code feels noticeably better lately. Not just faster or smarter, but more reliable and consistent. The sessions that used to deteriorate halfway through now maintain quality to completion. The forced compactions that interrupted complex refactorings now happen at logical breakpoints.

It’s worth noting: improving AI tools sometimes means accepting what looks like inefficiency. That “unused” 25-35% of context isn’t wasted—it’s working memory that enables everything else to function properly. Infrastructure thinking applied to user experience, where the best system management becomes invisible because it prevents problems rather than recovering from them.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on context management in AI development, read my analysis of Carrying Context or explore Claude-MPM’s approach to multi-agent orchestration.

TkDD: Ticket-Driven Development and the Knowledge We’re Throwing Away

Robert Matsuoka — Wed, 03 Dec 2025 15:03:12 GMT

TL;DR

Agentic coding sessions generate substantial contextual information—research, decisions, alternatives—that vanishes when the session ends
TDD captures behavior expectations; SDD captures requirements; neither captures the evolution of thinking as you figure things out
Ticket-Driven Development (TkDD) treats tickets as persistent knowledge containers for human-AI collaboration, not just task assignments
The workflow: Claude.AI builds specs → Linear captures them via MCP → coding agents pull work and write findings back via mcp-ticketer → knowledge accumulates instead of evaporating
TkDD is the opposite of vibe coding: structured context that compounds over time

I’ve been thinking about everything we throw away.

Last week I spent four hours with Claude Code researching authentication approaches for a SmartThings integration. Evaluated OAuth flows, considered token refresh strategies, dug into the API documentation, tested three different implementation patterns. The session produced maybe 200 lines of actual code. But the research—the reasoning about why I chose approach A over approach B, the edge cases I discovered, the documentation inconsistencies I noted—that took ten times longer to develop than the code itself.

And it’s gone. Buried somewhere in a chat history I’ll never scroll back through. Two days later, a colleague asked why I didn’t use the SmartThings webhook approach. I couldn’t remember. I’d evaluated it—I was 90% sure I had a good reason for rejecting it—but the rationale had evaporated. Ended up spending another hour re-researching something I’d already figured out.

That keeps happening to me. And I suspect it happens to you too.

The Knowledge Hemorrhage Problem

Every agentic coding session bleeds information. You meta-prompt, the agent refines the prompt then researches, you discuss, it proposes, you refine, it implements. Along the way you’re building context—understanding the problem space, eliminating dead ends, discovering constraints. That context is often more useful than the code itself.

But where does it go?

The code lands in a commit. Maybe you write a comment. The rest? Scattered across chat windows, lost in context limits, forgotten by tomorrow. Eleanor Berger calls this the shift from “interactive AI” to “asynchronous agents”—but even she focuses on the task delegation pattern, not the knowledge loss.

The irony gets me. We have these incredibly capable reasoning systems generating insights, and we’re treating their output like scratch paper. Use it once, toss it.

Even within a single project this gets painful. Three weeks into mcp-smarterthings, I needed to revisit the rate limiting approach. Had I already evaluated exponential backoff versus fixed delays? What were the SmartThings API’s actual limits versus what their docs claimed? I’d done that research. Somewhere. In some chat window. On some day. I ended up re-deriving half of it from scratch because finding the original conversation would’ve taken longer than just figuring it out again.

The Problem With TDD and SDD

I don’t use Test-Driven Development. Conceptually elegant, sure—write the test first, watch it fail, make it pass. But TDD assumes you know what you’re building before you build it. When you’re working to a spec, great. When you’re figuring things out as you go? Too restrictive. You end up writing tests for behavior you’ll change three times before lunch.

Same problem with Spec-Driven Development. You can do the research and write the spec. But as they say, a plan is only good until you get punched in the face. The spec captures your initial understanding. It doesn’t capture how that understanding evolved when you hit the first unexpected constraint. Or the second. Or the fifth.

What both paradigms miss: the thought process and changes to it.

That’s what I actually need when I come back to a project. Not the final answer—the path to it. The dead ends explored. The assumptions challenged. The “wait, that won’t work because...” moments. The pivots.

Paradigm What It Captures What It Loses TDD Behavior expectations via tests Research, decisions, context, evolution of thinking SDD Initial requirements and architecture How understanding changed during implementation Vibe Coding Nothing structured Everything—just vibes and prayers TkDD Work units + context + decisions + thinking evolution Still figuring this out

Tests document what the code should do. Specs document what you planned to build. Neither documents how you figured out what to build—which is exactly what you need when you come back in six months and can’t remember why you chose approach B over approach A.

TkDD: Tickets as Knowledge Containers

What I’ve been experimenting with lately: treating tickets as structured knowledge artifacts for human-AI collaboration, not just task assignments.

A ticket can hold:

The problem statement (not just “implement auth” but why and what constraints)
Research conducted (links, findings, dead ends identified)
Alternatives considered (and why they were rejected)
Decision made (with rationale)
How thinking evolved (initial approach → why it didn’t work → final approach)
Implementation notes (gotchas, edge cases discovered)
Links to related work (other tickets, PRs, documentation)

Tickets persist. They’re searchable. They have natural hierarchy—epic → story → task maps cleanly to context → decision → implementation. They survive sessions, agents, team members.

The tooling is catching up to this idea. GitHub Copilot’s coding agent now accepts GitHub Issues as input—you assign an issue to @copilot and it works autonomously. Devin integrates directly with Linear, triggering work when you add a label. Port.io documented an entire workflow for routing Jira tickets through GitHub Issues to Copilot. deepsense.ai built what they call an “AI Teammate” that reads Jira tickets and produces PRs.

The pattern is emerging. But most implementations focus on the task execution side—ticket goes in, PR comes out. They’re not capturing the knowledge generated along the way.

That’s the gap I’m trying to fill.

From aitrackdown to mcp-ticketer: The Human-AI Collaboration Insight

I built aitrackdown as an AI-first ticketing system. The idea was straightforward: design a ticket structure specifically for AI agents to consume—structured fields, clear acceptance criteria, machine-readable context. And it worked. To a degree.

But here’s what I got wrong: tickets aren’t just for AI. They’re for human-AI interaction.

The tooling that lets humans read and respond to tickets matters just as much as the tooling that lets agents process them. A ticket perfectly structured for Claude Code but unreadable by your PM is a failure. A ticket that captures agent findings but buries them in JSON blobs nobody will ever review? Also a failure.

That insight flipped my approach. I stopped trying to build for AI and started building for the collaboration. That’s when mcp-ticketer happened.

The mcp-ticketer Approach

mcp-ticketer works with multiple ticketing systems. Not because I couldn’t pick one, but because that’s where the work actually lives.

I use GitHub Issues to track reported problems—that’s where users file bugs, that’s where they should stay. Linear handles my personal projects because I love the interface and the keyboard shortcuts don’t make me want to throw my laptop. Client work? Some clients use Linear, others are Jira shops. You meet people where they are.

aitrackdown still exists in the stack. I rarely use it these days. The AI-first structure turned out to matter less than the human-AI collaboration layer on top.

The critical capability I built into mcp-ticketer: agents can write to tickets, not just read from them.

This isn’t standard behavior in most integrations. The typical pattern is ticket-in, PR-out. mcp-ticketer lets an agent update the ticket as it works. When a coding agent hits a decision point, it can record what it learned. When it discovers an edge case, that goes into the ticket. When it rejects an approach, the reasoning gets captured. The ticket becomes a living document of the work—not just the assignment, but the execution.

More importantly: when your thinking changes, the ticket captures that evolution. “Started with approach X, but discovered Y constraint, pivoted to Z.” That’s the knowledge that disappears in every other workflow.

The Workflow: Thinking and Doing, Separated

Here’s how the pieces fit together in my current setup:

I start in Claude.AI—the web interface, not Claude Code. This is deliberate. Claude.AI is for thinking. Researching approaches, discussing tradeoffs, building specifications. The Linear MCP connector lets me create tickets directly from the conversation.

A session might go like this:

“Let’s figure out how to handle SmartThings device state synchronization”
[Research, discussion, alternatives considered]
“Create a Linear ticket capturing this approach”
[Ticket created with full context, not just a one-liner]

The specification lives in the ticket. The research lives in the ticket. The decision rationale lives in the ticket.

Then the coding agent takes over. Claude Code pulls work from tickets via mcp-ticketer. The ticket provides context—not just “implement sync” but the full specification, the constraints identified, the approach selected.

The agent works. When it hits decisions, it updates the ticket. When it discovers undocumented API behavior, that goes in the ticket. When the original approach doesn’t work and thinking evolves—that gets captured too. When it completes, the implementation notes go in the ticket.

The result: knowledge that compounds. Next time I need to work on this codebase—or a similar one—the tickets are there. Searchable. Structured. I’m not starting from zero. I’m not re-researching things I already figured out.

The claude-code-skills repository shows what this can look like at scale—29 production skills implementing full Agile automation with Linear, including Epic → Story → Task hierarchy management. That’s the direction: tickets as the coordination layer for AI-augmented development.

mcp-smarterthings: Knowledge Capture in Action

The mcp-smarterthings project became my testing ground for TkDD. SmartThings integration has enough complexity—OAuth, device capabilities, real-time events, state synchronization—that I knew I’d lose critical decisions if I didn’t capture them somewhere.

Here’s what ticket-captured knowledge actually looks like. During the implementation, the agent documented complete code samples for the SmartThings API integration patterns:

// Example: Device capability handler pattern
const handleCapability = async (deviceId: string, capability: string) => {
  const device = await smartthings.devices.get(deviceId);
  const status = await smartthings.devices.getCapabilityStatus(
    deviceId, 
    capability
  );
  return { device, status };
};

These code samples were originally designed for the classic PM-to-engineer handoff. “Here’s what we need, here’s roughly how it should work, go build it.” But in a TkDD workflow, they serve a different purpose: persistent knowledge available for any future human or agent review.

Six weeks from now when I need to add a new capability handler? I don’t have to re-derive the pattern. The ticket has it. When a different agent picks up related work? Context is already there. When I’m explaining the architecture to a collaborator? I can point them to the ticket instead of recreating the explanation from memory.

The tickets in that Linear project contain:

Initial research on SmartThings API versions and deprecation timelines
Decision rationale for choosing the new API over legacy endpoints
Code samples for common patterns (auth, device commands, event subscriptions)
Edge cases discovered during implementation
Links between related tickets showing how the architecture evolved

That last point matters. The tickets aren’t isolated—they reference each other. You can trace how “implement basic device control” led to “handle rate limiting” led to “add request queuing” led to “implement webhook fallback.” The evolution of understanding is visible.

Building Context, Not Burning It

LLMs need context to be effective. That’s not news. But where does context come from?

Right now, mostly from re-explaining things every session. “This is a Next.js project, we’re using TypeScript, here’s the authentication pattern, here’s why we chose this approach...” Over and over.

TkDD builds a structured context base over time. The tickets contain the decisions. The tickets contain the rationale. The tickets contain the evolution of thinking—how you got from “I think we should do X” to “actually Y works better because...”

When you start a new session, you’re not starting from scratch—you’re starting with accumulated knowledge.

Pull in the relevant tickets. The agent has context. Not just “what to do” but “why we’re doing it this way” and “what we already tried” and “what constraints matter” and “how our understanding changed.”

Cross-project learning becomes possible too. Authentication patterns you figured out on project A? The tickets document the research—including the dead ends. When project B needs similar auth, you’re not re-deriving first principles. You’re not re-exploring the same dead ends.

The Paradigm Claim

Test-Driven Development: Tests define expected behavior. Assumes you know the behavior upfront.

Spec-Driven Development: Specifications define requirements. Assumes requirements survive contact with reality.

Ticket-Driven Development: Tickets define work units AND capture how understanding evolves while doing the work. The ticket is both the input and the output. Built for human-AI collaboration, not just AI consumption.

TDD asks: “What should this code do?” (Assumes you know.) SDD asks: “What are we trying to build?” (Assumes the plan survives.) TkDD asks: “What do we know, what are we learning, and how is our thinking changing?”

Vibe coding treats every session as a fresh start. TkDD treats every session as a contribution to an accumulating knowledge base—one that captures not just conclusions, but the reasoning that got you there.

I’m still working out the edges of this. The tooling is imperfect—mcp-ticketer exists because nothing else handled the multi-system reality of how I actually work. The workflow requires discipline that pure vibe coding doesn’t demand.

But the knowledge loss problem is real. I’ve wasted hours re-researching things I’d already figured out. I’ve made decisions twice because I couldn’t find where I’d made them the first time. I’ve watched context evaporate at the end of every session.

We can do better than that.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent workflows, see my analysis of Claude Code’s orchestration capabilities or my deep dive into the knowledge management problem in AI development.

Every Claude.AI Tab You Open Gets Its Own "Server"

Robert Matsuoka — Tue, 18 Nov 2025 15:02:27 GMT

A Claude.AI webpage behaves like a (server) container

I noticed that a Claude.AI webpage behaved like a (server) container. That was surprising. Tested container isolation between Claude.AI tabs. Found different containers. Did not expect that.

Then checked the hostname in a fresh tab—before creating any artifacts or running anything. gVisor was already there. Already running.

That changed what I thought I understood about the architecture. Then I realized I was misunderstanding what that architecture actually costs.

TL;DR

Every Claude.AI page load allocates a gVisor sandbox with isolated filesystem, process namespace, and up to 9GB memory limit—verified through repeated testing
gVisor sandboxes ≠ VMs: Lightweight userspace isolation running thousands per host, sharing base images, allocating memory on-demand—not dedicated 9GB per tab
My initial cost estimates were probably wrong by 10x+: Naive VM pricing (~$0.06/hour) dramatically overstates actual infrastructure costs for this architecture
The UX bet remains real: Pre-allocation means zero latency when you need containers, but actual unit economics are likely far better than my initial modeling suggested
Still explains infrastructure investment: Even with better-than-VM economics, scale drives need for owned infrastructure—just not as dramatically as first calculated

The Container Allocates on Page Load

Two tabs. Different conversations. Tested the isolation:

# Tab 1
touch /tmp/session_test_22323.txt

# Tab 2  
ls /tmp/session_test_22323.txt
# Doesn’t exist

Different containers. That’s established.

Here’s what I missed: the container doesn’t allocate when you create an artifact or run bash commands.

It allocates the moment you load the page.

Check the hostname in a fresh Claude.AI tab—before you’ve done anything:

echo $HOSTNAME
# runsc

That’s gVisor. Already running. Just waiting.

Across repeated tests in my environment, this behavior is consistent. Every page load. Every time.

What’s Actually There (And What It Actually Means)

Every single Claude.AI tab I’ve tested loads with:

dpkg -l | wc -l
# 871 packages

du -sh /usr
# 4.8GB

free -h
# 9GB RAM allocated

Full Ubuntu environment. Complete with git, development tools, the works.

Not “if you use computer features.” Not “when you create artifacts.”

On page load.

But here’s where I initially got the implications wrong.

What gVisor Actually Is (Not What I First Assumed)

When you see HOSTNAME=runsc and free -h showing 9GB, it’s natural to think: “Each tab gets a dedicated VM with 9GB RAM and a full Ubuntu install.”

That’s not how gVisor works.

gVisor is a userspace kernel, not a hypervisor:

Intercepts syscalls from guest processes
Enforces isolation at the syscall boundary
Runs thousands of sandboxes per host
Shares most resources across sandboxes

The “9GB RAM” is a limit, not an allocation:

The sandbox can use up to 9GB if needed
Only pages actually touched consume real RAM
Most sessions touch a tiny fraction
The host only pays for what’s actually used

The “4.8GB filesystem” is shared:

Read-only Ubuntu base image mounted for thousands of sandboxes
Only per-sandbox writes go into small overlay layers
Shared image cached on each node
Most sessions never touch more than a subset

So when I initially modeled this as “VM-equivalent costs,” I was probably off by at least an order of magnitude.

The Scale Math (Revised Understanding)

Let me recalculate with a better understanding of the architecture.

Conservative user assumptions:

5M monthly active Claude.AI users
Average 3 tabs per session
10 sessions monthly
2 hours average per session

What actually happens: Every tab load = gVisor sandbox allocation 5M × 3 tabs × 10 sessions = 150M sandbox allocations monthly

Where my initial cost model went wrong:

I used VM pricing (~$0.06/hour) as a proxy. That assumes:

Dedicated compute resources per instance
Full memory allocation on provision
Traditional VM overhead

gVisor sandboxes are fundamentally different:

High-density multiplexing (thousands per host)
Memory allocated on-demand, not reserved
Shared base images across all instances
Aggressive overcommit strategies

Actual infrastructure costs are probably:

5-10x lower than VM-equivalent pricing
Still significant at scale
Driven by peak concurrent sandboxes, not total hours
Dependent on idle timeout policies

Even at 10x better economics than I initially modeled, you’re still looking at substantial infrastructure investment requirements. Just not the “barely breaking even on Pro users” story I first calculated.

The Design Logic (Still Valid)

From a technical perspective, the logic holds:

The problem: You can’t predict when someone will create an artifact or run bash commands. Wait to provision? Add latency. Users notice.

The solution: Pre-allocate on page load. Sandbox’s already there. When they create an artifact—instant.

Trade-off: Pay for sandboxes whether users need them or not. Most sessions I’ve observed never use computer features. You’re still carrying that cost.

The difference: That cost is lower than I initially thought, but the architectural choice remains the same.

The Warm Pool (Now Makes More Sense)

That 1-minute uptime I keep seeing? Fresh from the pool.

How it likely works:

Anthropic maintains warm pool of pre-initialized sandboxes
Base Ubuntu image already mounted
Page load grabs one, adds overlay filesystem
Associates with your browser session
Returns to pool after timeout

Why there’s no provisioning latency:

Pool already running
Just binding + overlay setup
Normal page load times
Sandbox ready when needed

Why this is still expensive (but sustainable): With gVisor’s density, you can run thousands of sandboxes per host. But you still need:

Enough hosts for peak concurrent sessions
Warm pool sized for typical tab patterns
Infrastructure to handle burst traffic

The absolute numbers are lower than VM math suggests. The architectural complexity and scale requirements remain.

The Unit Economics Question (Revised)

Look at this from Anthropic’s perspective with better cost assumptions.

If actual sandbox costs are 5-10x lower than VM pricing:

Pro user at $20/month:

Opens 30 tabs across 10 sessions
Each tab = 2 hours of sandbox time
Container costs: $0.36-0.72 (not $3.60)
Plus LLM inference: $8-12
Total cost: $8.36-12.72

Margins look healthier. Still tight for power users.

Power users:

10+ tabs simultaneously
All-day sessions
Multiple daily sessions

Even at 10x better economics: 10 tabs × 8 hours × $0.006-0.012 = $0.48-0.96 per session

20 sessions monthly: $9.60-19.20 in container costs (vs. $96 in my initial model)

Free tier users: Still subsidized. Just not as dramatically.

Why This Still Explains August

The architecture helps explain behaviors I observed, even with revised cost understanding:

Scaling challenges:

Demand surge
Each user = multiple concurrent sandboxes
Pool capacity constraints
Coordination across hosts

Rate limiting context: Not just LLM tokens. Sandbox capacity matters. High-density multiplexing has limits.

Infrastructure investment rationale: Even with better-than-VM economics, owned infrastructure makes sense:

Optimize for this specific workload
Custom gVisor configurations
Eliminate cloud provider margins
Better control over density and overcommit

The $50B investment still makes strategic sense. The immediate economic pressure just isn’t as severe as I first calculated.

What Other Chat Interfaces Do

ChatGPT: Code interpreter runs in sandboxed Python (Docker containers). Documentation suggests persistent sessions tied to chat conversations rather than page loads, but OpenAI hasn’t published specific provisioning details.

Gemini: Similar Python sandbox approach. Code execution is an optional feature that can be enabled via API or CLI flags, suggesting on-demand provisioning, though Google hasn’t detailed the exact architecture.

Claude.AI: Full gVisor sandbox with Ubuntu environment. On page load. Whether you use it or not.

The capability difference remains significant. The cost differential is smaller than I initially thought, but the architectural complexity gap is real.

The Architecture Trade-Off (Still Stands)

Anthropic made a choice:

Option A: Provision on-demand

Lower cost (only pay when used)
Adds latency (users wait for provision)
Simpler infrastructure

Option B: Pre-allocate on page load

Higher cost (pay whether used or not)
No latency (already there)
More complex infrastructure

They picked B. The bet on experience over cost efficiency remains.

The actual cost premium is probably smaller than VM math suggests. The infrastructure complexity and engineering investment required is just as high.

What I Got Wrong (And Right)

What the testing showed accurately:

gVisor sandboxes allocated on page load
Separate isolation per tab
Full Ubuntu environment available
Zero-latency artifact/bash execution

What I initially misunderstood:

Sandbox costs ≠ VM costs
“9GB RAM” is a limit, not an allocation
Filesystem is shared, not per-instance
Density changes economics dramatically

What remains true:

This architecture is more complex than competitors
Pre-allocation strategy requires more infrastructure
Most users never touch container features
Scale drives need for vertical integration

The Revised Bottom Line

Based on repeated testing in my environment, every Claude.AI tab you open gets a gVisor sandbox with a full Ubuntu environment.

Not when you use computer features. On page load.

The economics (revised understanding):

Still significant infrastructure investment
Probably 5-10x better than my initial VM-based modeling
Driven by peak concurrency and warm pool sizing
Requires sophisticated resource management

This gives you capabilities no other chat interface provides. Complete isolation. Real development environment. Zero latency when you need it.

My initial cost estimates were probably wrong by an order of magnitude. The architectural sophistication and strategic investment requirements remain accurate.

My working theory: Anthropic prioritized capability and UX, betting on gVisor’s density to make the economics work while still requiring substantial infrastructure investment for vertical integration benefits.

The question isn’t whether the architecture is sustainable—it probably is. The question is whether the capability advantage justifies the infrastructure complexity.

Every time you load a Claude.AI page, you’re triggering a sandbox allocation. The cost is lower than I first calculated. The architectural commitment is just as high.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more infrastructure insights, read my analysis of multi-agent orchestration costs or my deep dive into AI development tool economics.

I Tracked Every Token

Robert Matsuoka — Wed, 05 Nov 2025 15:00:41 GMT

I spent $1.07 and 8 minutes fixing a bug today that would have cost me $376 to outsource. That’s based on hiring an offshore developer at typical $40/hour rates for work that would take roughly 9.4 hours based on the agent execution time.

But here’s the thing about personal projects: my consulting time commands good rates, but for my own website? I’m not billing anyone. The comparison that matters isn’t theoretical cost savings—it’s $376 in actual engineering costs versus $1.07 in AI costs, plus 8 minutes of my attention versus the coordination overhead of managing another engineer. (I know the numbers are small, but bear with me, they’re meant to be illustrative).

More importantly: this bug touched multiple system layers that would have traditionally required coordinating several specialists. RSS feed integration, data transformation, React rendering, deployment pipeline. I handled all of it solo using orchestrated AI agents.

Eight minutes to save $375 and tackle work requiring team coordination shows why we’re seeing heightened interest in agentic coding tools—though my experience represents a best-case scenario worth examining critically.

I tracked every token. I have the receipts. And the session logs reveal something more interesting than just impressive cost savings: they show exactly which scenarios deliver real value, and which don’t.

The Session: From Bug Report to Production

Here’s what actually happened. I noticed my personal website’s blog integration was broken. Posts weren’t loading correctly, the feed was throwing errors, and the whole thing needed fixing before it became a bigger problem.

Four Agents

Rather than hiring an engineer and managing the work, I used Claude-MPM to orchestrate four AI agents in a single browser tab:

Research Agent - Investigated the codebase, identified the broken blog feed integration, analyzed the data flow from Substack RSS to the site’s display logic.

Engineer Agent - Implemented the initial fix attempt, modified the feed fetching logic, updated the data transformation pipeline.

Web QA Agent - Ran Playwright tests against the staging deployment, verified the fix worked, validated all blog posts loaded correctly.

React Engineer - Handled the root cause analysis when the first fix proved incomplete, restructured the entire blog integration architecture, ensured production-ready quality.

My actual work? Writing the initial prompt, monitoring progress in a single tab while working on four other projects, and occasionally redirecting agents when they went down wrong paths. Eight minutes of attention across an hour of wall-clock time while the agents executed in parallel.

This is my normal workflow now: Claude-MPM managing agent coordination in one tab while I context-switch between five active projects. The orchestration framework handles the parallel execution and agent handoffs. I provide strategic direction and quality oversight.

The Economics: $1.07 Instead of $376 (Plus Coordination Overhead)

Let’s break down what this work actually cost, with transparent assumptions:

Token Consumption:

Total tokens: 104,641 tokens (52.3% of Claude’s 200K context window)
Input tokens: ~41,856 tokens (context, instructions, code)
Output tokens: ~62,785 tokens (analysis, code generation, fixes)

Cost Breakdown at Claude Sonnet 4.5 rates:

Input cost: $0.13 ($3 per million tokens)
Output cost: $0.94 ($15 per million tokens)
Total AI cost: $1.07

My Time Investment:

Active oversight: 8 minutes of attention
Wall-clock time: ~60 minutes while agents worked in parallel
Personal project time: Not billable to anyone

Alternative: Hiring an Engineer:

Offshore/nearshore developer at typical $40/hour rate (conservative market rate)
Estimated work: 9.4 hours (extrapolated from agent execution time plus debugging/testing)
Engineering cost: $376
My coordination time: Minimum 1 hour for requirements, review, deployment
Total traditional cost: $376 in hard costs plus coordination overhead

For personal projects where I can’t bill anyone, the relevant comparison is $376 in engineering costs versus $1.07 in AI costs. The dollar savings matter. But the time efficiency—eight engaged minutes for multi-system work—is what makes personal projects actually viable. This work would have stayed broken indefinitely because neither the cost nor the coordination overhead justified the fix.

Beyond Cost: Capability Expansion

The economics matter, but something else is happening: agentic tools expand what I can attempt without team support.

This blog integration issue touched multiple system layers—RSS feed parsing, data transformation, React component rendering, deployment pipeline. Five years ago, I would have needed multiple team members to handle this reliably: a backend engineer for the feed integration, a frontend specialist for the React components, a QA engineer to verify everything worked.

Now I orchestrate AI agents to handle all of it. This isn’t about replacing coding skill—I could write this code myself. It’s about managing complexity across multiple subsystems efficiently enough to be worth doing at all.

In practice, I’m now regularly tackling work that would have required team coordination:

Building full-stack features solo that would have required 2-3 engineers
Maintaining multiple production systems without dedicated devops support
Implementing complex integrations that would have needed specialized expertise
Deploying changes across distributed systems with confidence

I’ve seen similar patterns with other technical people, though the effectiveness varies by context:

Senior engineers shipping entire features without team support (when scope is well-defined)
Technical founders building production systems solo during early stages (though architectural complexity still requires expertise)
Staff engineers prototyping architectural changes across multiple services (within their existing systems)
Technical product managers implementing fixes without engineering allocation (for straightforward issues)

The bar for what technical people can attempt solo has shifted in some contexts—though complex projects still require specialized expertise and human oversight. Work that required team coordination and specialized roles can become feasible for individuals with good orchestration skills when the problem type fits.

What the Token Timeline Reveals

The session analytics show something interesting about how agentic work actually unfolds:

 50K ┤● Initial context (project files, instructions)
     │
 60K ┤  ●● Investigation phase (research agent)
     │
 70K ┤     ●● First implementation attempts
     │
 80K ┤       ● QA and deployment
     │
 90K ┤         ●●● Root cause deep dive (largest spike)
     │
100K ┤              ●● Cleanup & analytics
     │
105K ┤                  ● (current)
     └─────────────────────────────────────
     0   10   20   30   40   50   60 (minutes)

Nearly half the tokens (47.5%) went to initial context loading—establishing what the project was, how the code worked, and what needed fixing. Another 25% covered investigation and research. Only 15% went to actual implementation.

This distribution reveals the paradigm shift: I wasn’t coding. I was orchestrating intelligence to understand the problem, propose solutions, verify quality, and handle edge cases. The traditional developer workflow inverts when AI handles implementation while humans handle strategy.

The Orchestration Paradigm: From Hiring to Directing

Here’s what changed about how work gets done. Instead of four browser tabs managing agents manually, I use Claude-MPM to handle agent coordination in a single tab. My attention switches between five active projects while the framework manages parallel execution.

The work that happened (tabs here refer to iTerm2 tabs — my go-to terminal viewer).

Single Claude-MPM Tab:

Research agent analyzing the codebase
Engineer implementing fixes
QA agent testing deployments
React specialist handling complex restructuring
All coordinated through the orchestration framework

Four Other Project Tabs:

Client work continuing in parallel
No context switching penalty
Framework handles agent handoffs and progress tracking

My job shifted from implementation to strategic direction:

Initial prompt: “Blog feed is broken, investigate Substack RSS integration”
Occasional redirects: “The initial fix didn’t address root cause—investigate data transformation layer”
Quality checkpoints: “Verify all historical posts load, not just recent ones”
Final verification: “Document architectural decisions”

Traditional software development requires your full attention during execution. You write code, debug issues, test thoroughly, deploy carefully. Each step demands focus.

Orchestrated development enables parallel execution across multiple projects. The framework manages agents while you provide strategic oversight. Quality emerges from good specifications and periodic verification, not constant supervision.

This approach works particularly well for personal projects where coordination overhead would kill the project entirely. My website bug wasn’t worth hiring an engineer—the coordination time would have cost more than the fix was worth. Claude-MPM made it viable by requiring only eight minutes of my attention.

What made this work so efficiently: eliminated hiring overhead entirely, agents executed in parallel while I context-switched to other projects, and eight minutes of total attention across an hour of wall-clock time.

When the Numbers Don’t Work: The Long Tail Problem

The Long Tail Reality

Here’s what I need to be honest about: this session represents a best-case scenario across both dimensions—cost savings and capability expansion. The value doesn’t materialize consistently.

The Economics Long Tail:

Success cases like this bug fix sit at one end of a distribution. At the other end are situations where AI assistance provides marginal value, no value, or negative value:

Marginal gains (2-5x cost reduction, limited capability expansion):

Work requiring significant oversight (coordination overhead remains)
Complex debugging demanding you’d hire senior engineers anyway
Projects where verification time approaches implementation time
Multi-system work where AI lacks critical domain context

No gains (1x or worse):

Work requiring deep expertise that AI fundamentally lacks
Legacy systems with undocumented architectural decisions
Security-critical code where verification exceeds any AI savings
Projects where coordination overhead was never the limiting factor
Genuinely novel problem spaces with no relevant training data

Negative value:

Fixing AI-generated bugs costs more than hiring correctly
AI confidently provides wrong solutions requiring extensive debugging
Management overhead increases rather than decreases
AI suggestions waste more engineering time than they save
Multi-system changes introduce subtle integration failures

The distribution matters more than the peak. My bug fix achieved major cost savings ($376 vs $1.07) and demonstrated capability expansion (solo work spanning multiple systems). But across client work over the past three months, the aggregate value is more modest—typically 2-3x gains from acceleration rather than transformation.

What determines where you land on this curve?

High-value scenarios (cost + capability):

Small projects where coordination overhead dominates
Personal work not worth hiring for at any price
Multi-system fixes with clear success criteria
Modern tech stacks matching AI training data
Straightforward verification and testing

Moderate-value scenarios (primarily acceleration):

Work within managed teams requiring oversight anyway
Complex problems needing senior judgment but benefiting from AI assistance
Projects where you’re coordinating engineers regardless
Incremental features in well-understood systems

Low-value scenarios:

Legacy systems requiring extensive human context
Novel problem spaces with no training data
Security or performance-critical code demanding extensive review
Work where AI suggestions require more debugging than starting fresh
Political or organizational constraints on implementation approach

The heightened investor interest makes sense for the high-value category—massive volume of small projects plus capability expansion for individual contributors. Market skepticism makes sense for larger managed work where you’re hiring teams regardless.

The Token Efficiency Story

Something interesting emerged from analyzing the session: token efficiency mattered more than raw capability.

Initial context loading consumed nearly 50,000 tokens—establishing project structure, reading configuration files, understanding the codebase. This overhead was identical whether I was fixing a trivial bug or implementing a complex feature.

For small tasks, that overhead dominates. A 5-line change doesn’t justify 50,000 tokens of context establishment. The productivity multiplier approaches 1x or worse.

For complex tasks spanning multiple files and subsystems, the overhead amortizes. My 8.1x multiplier came from a bug fix that required:

Analyzing 8 different files
Understanding data flow across 3 system layers
Implementing changes in 4 locations
Verifying behavior across multiple post types
Documenting architectural decisions

The same initial context cost, but much higher value from the work performed.

Token efficiency insights:

Single-session work: No context restarts saved ~50,000 tokens
Targeted analysis: Research agent methodology minimized exploratory waste
Parallel processing: Multiple files analyzed simultaneously without token duplication
Smart caching: Repeated file access handled efficiently

The session used 52.3% of the available 200,000-token context window. Efficient enough to complete the work in one session, but not wasteful. A goldilocks utilization rate.

Why Investors Are Interested (And Where Skepticism Is Warranted)

The funding activity around AI coding tools—Augment’s $227M (enterprise-focused coding assistant), Magic’s $320M (autonomous coding agents), Codeium’s $150M (AI code acceleration platform)—reflects recognition of two distinct value propositions: cost savings on small projects and capability expansion for individual technical contributors.

Where the investment thesis makes sense:

The addressable market appears substantial across two dimensions:

Cost savings market: Small projects, personal sites, maintenance tasks—all the work below the “worth hiring for” threshold—suddenly become economically viable. Millions of these decisions happen daily.
Capability expansion market: Technical people at various levels handling multi-system work solo that traditionally required team coordination. Senior engineers, staff engineers, technical founders, technical PMs—all expanding what they can accomplish independently within certain contexts.

Both markets show real, measurable value when conditions align. Token costs decline while capabilities improve. The technology enables both cost arbitrage and skill amplification simultaneously.

Where caution is warranted:

Neither value proposition scales uniformly to all software work. Personal project economics don’t apply when you’d hire and manage teams regardless. Capability expansion has limits—complex projects still require specialized expertise and human oversight. Context windows constrain problem scope. Verification overhead grows with project complexity.

The peak performance cases—like my $376 saved, 8-minute fix spanning multiple system layers—aren’t representative of median experience. Legacy systems, ambiguous requirements, and genuinely novel problems still resist automation.

My website bug represents the sweet spot: clear problem, modern stack, straightforward solution, multi-layer complexity. It demonstrates both cost savings (work that wouldn’t happen) and capability expansion (work that would have required team support).

Companies that can deliver consistent value across both use cases without overpromising tend to build sustainable businesses. Companies that only optimize for one scenario or suggest they can replace entire engineering organizations risk underdelivering on expectations.

What This Means for Practitioners

The session analytics reveal a practical framework for when to employ agentic coding tools across two distinct value dimensions:

Cost savings scenarios (personal projects, small fixes):

Projects too small to justify hiring (personal sites, side projects, maintenance)
Coordination overhead would exceed implementation value
Clear specifications and straightforward verification
Multiple independent tasks can run in parallel
Modern tech stacks with good AI training coverage

Capability expansion scenarios (individual technical work):

Multi-system work traditionally requiring team coordination
Full-stack features spanning frontend, backend, and infrastructure
Complex integrations across distributed systems
Prototyping architectural changes before team involvement
Maintenance across multiple production systems

Both scenarios benefit from:

Good orchestration frameworks (like Claude-MPM)
Clear problem specifications
Ability to verify results independently
Iterative refinement workflows

Deploy tactically when:

Accelerating work within managed teams (2-3x gains typical)
Augmenting rather than replacing hiring decisions
Clear subtasks within larger projects requiring human judgment
Prototyping before committing team resources

Avoid or minimize when:

Project complexity demands senior engineering judgment regardless
Legacy systems requiring extensive human context
Security or performance criticality mandates extensive review
Verification overhead approaches or exceeds implementation savings

The key questions aren’t just “Can AI do this work?” but rather:

Cost dimension: Would I have hired someone, or would coordination overhead kill the project?
Capability dimension: Does this require coordination across specialists I don’t have access to?

If either answer is yes—small projects, personal work, or multi-system complexity requiring team coordination—the value can be compelling. If you’d hire and manage a focused team anyway, the gains are more modest but still meaningful.

The Honest Bottom Line

My $1.07 bug fix demonstrates two valuable aspects of agentic coding tools: $376 saved in engineering costs (based on $40/hour offshore rates for 9.4 hours of work) and expanded capability to handle multi-layer system work solo that would have required team support.

The first value—cost savings on small projects—applies specifically to work that falls below the “worth hiring for” threshold. The second value—capability expansion—applies in certain contexts across technical roles attempting work that would have traditionally required coordination across specialized team members.

The recent funding activity makes sense when you consider both markets: the massive volume of small projects that become viable, and the expanded capability bar for what technical people can attempt independently in favorable conditions. Market skepticism makes sense when you focus on complex work requiring team oversight regardless of tooling.

Understanding which scenario you’re in determines what value you’ll see:

Personal projects and small fixes: Cost savings ($376 vs $1.07) and time efficiency (8 minutes vs coordination overhead) make previously unviable work viable.

Capability expansion: Technical people at various levels—senior engineers, staff engineers, technical founders, technical PMs—attempting full-stack or multi-system work that would have required team coordination, when the scope and context align.

Larger managed projects: More modest 2-3x acceleration where you’re managing teams anyway.

For my website bug fix? Both values delivered—work got done that otherwise would have stayed broken, and I handled complexity that would have required multiple specialists. For larger client work? The primary value comes from acceleration rather than enablement.

Both scenarios represent real value. Both are happening across the industry. The difference is knowing which applies to your situation and what you’re actually optimizing for.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. The orchestration framework used in this article is Claude-MPM, my open-source multi-agent project management system. For more on orchestration strategies, read my analysis of multi-agent coordination patterns or my comparison of orchestration frameworks.

Semantic Searching Is Also Good For Visualizations

Robert Matsuoka — Fri, 31 Oct 2025 14:04:25 GMT

I believe that Augment Code has the best context game in down. Much of that has to do with it’s great semantic code search engine.

MCP Vector Search helps if you’re using Claude Code. It does semantic embeddings of text and AST embeddings of many types of code.

Which as it turns out is also an interesting way to build visualizations of your code base. Check it out!

pipx install mcp-vector-search