Hyperdev: Articles

What’s Old Is New Again

Robert Matsuoka — Wed, 03 Jun 2026 11:31:06 GMT

Most of the best ideas in software engineering aren’t new. They’ve been written up in books, argued over at conferences, taught in every “best practices” deck since the late 1990s. And most teams quietly don’t do them.

Not because anyone thinks they’re wrong. Test-driven development, design by contract, architecture decision records, mutation testing — ask a room of senior engineers whether these are good ideas and you’ll get nods. Ask the same room who practices them consistently under deadline pressure and the hands stay down. I’ve been in that room for twenty-five years, on both sides of the question. I’ve also been the engineering leader who let those practices slip because shipping the feature mattered more this quarter.

There’s a single economic reason these practices lose. The upfront cost is high, the payoff is real but distant, and human attention is the binding constraint. Write the test before the code, document the decision, specify the invariant — every one of those is a tax you pay now against a benefit you collect later, maybe, if the project lives long enough. Under deadline pressure, that’s a losing trade for a human. So we skip it, ship, and pay the interest later in bugs and confusion. Call it the impatience tax.

Agents don’t pay that tax. They have infinite patience for upfront rigor and roughly zero marginal cost for the tedious work that rigor demands. Writing a thorough test suite for code that doesn’t exist yet is psychologically brutal for a person and completely fine for a model. That single shift — the cost of patience going to zero — quietly inverts the economics of a whole list of practices we knew were right and gave up on anyway.

This isn’t a piece about what AI makes possible. Lots of things are possible. It’s about a narrower, more useful question: which disciplines did we already agree were correct, fight about for decades, and abandon for reasons that no longer hold?

Here are nine.

TL;DR

These nine practices share one structure: high upfront cost, distant payoff. Human attention is the constraint that kills them under deadline pressure.
AI removes the constraint. A failing test is the clearest prompt you can hand an agent; a spec is its input; an ADR is its context. The discipline becomes the interface.
TDD, design by contract, and property-based testing turn from “things we should do” into the most effective way to constrain agent behavior and prevent hallucinated correctness.
Documentation, ADRs, and living docs get a bilateral ROI: agents generate them from code, and they make agents far more effective in your codebase.
The catch is real. A 2025 METR randomized trial found experienced developers were about 19% slower with AI assistance. These practices pay off only when AI is used with discipline, not as autocomplete.

1. Test-Driven Development

Start with the practice most teams abandoned first.

Writing tests before code was always the theoretically superior move. It forces you to define the interface before you build behind it, catches bugs at the moment of definition instead of during integration, and leaves behind a living specification of what the code is supposed to do. Kent Beck made the case decades ago and the case held up.

Almost nobody did it consistently. The reason is psychological, not technical. Writing detailed tests for code that doesn’t exist yet, while a deadline breathes on your neck, feels like building scaffolding for a house you haven’t designed. Your brain screams at you to just write the function. So you write the function, promise yourself you’ll add tests after, and — well. You know how that goes.

Now flip the perspective. To an agent, a failing test isn’t scaffolding. It’s the clearest possible specification of intent you can provide. “Make this pass, don’t break anything else” is an unambiguous, machine-checkable instruction, which is exactly what a probabilistic system needs to stay honest. The test suite becomes a guardrail that prevents the most dangerous failure mode in AI-assisted coding: confident, plausible, wrong. Hallucinated correctness dies against a red bar.

TDD went from the discipline most teams couldn’t sustain to one of the best tools we have for bounding what an agent is allowed to claim it did. Same practice. Opposite economics.

2. Spec-Driven Development and Design by Contract

Bertrand Meyer formalized design by contract in the 1980s and built it into the Eiffel language: specify preconditions, postconditions, and invariants, then let the implementation follow from the contract.

The idea was sound and the adoption was thin, for one stubborn economic reason: the contract only pays off if someone else writes the implementation from it. If you’re writing both the spec and the code, the spec is overhead — you already know what you meant. The contract’s value lives in the handoff, and for most of software history there was no cheap handoff to hand it to.

Now there is. You write the contract; the agent writes the implementation from it. Spec-driven development stops being a documentation chore and becomes the actual control surface for delegation. The spec is the part requiring human judgment about what the system should do. The implementation — the part that used to eat the hours — is the part you delegate. Meyer’s economics finally close, forty years late, because the missing party in the transaction showed up.

3. Architecture Decision Records

Why does the codebase look like this? Why Postgres and not Dynamo, why this queue, why the weird module boundary that everyone trips over?

ADRs were the right answer to that question — a short dated record of each significant decision, the context, and the alternatives rejected. The discipline almost never held. Same shape as everything else here: the cost is immediate (stop, write the thing) and the value accrues slowly, mostly to some future engineer who isn’t in the room yet.

Two things flipped at once, which makes this one more interesting than the rest. First, agents can generate ADRs from an existing codebase — read the git history, the dependency choices, the structure, and reconstruct the decisions that produced them. The retroactive cost of documentation drops toward zero. Second, and this is the part people miss: existing ADRs dramatically improve what an agent can do in your codebase. An agent that can read why you chose eventual consistency won’t keep proposing changes that assume strong consistency.

So the ROI went bilateral. Agents help you write ADRs, and ADRs help agents help you. The practice that used to only cost now pays on both ends.

4. Continuous Code Review

“Catch issues early” has sat on every best-practices list since Extreme Programming put continuous review on the map. The advice was never controversial. The bottleneck was always the same: human reviewer attention is finite, expensive, and easily exhausted. So review collapsed into batch PR review — a tired engineer reading a 600-line diff on a Friday afternoon, approving most of it on faith.

AI review on every commit — not batched at the PR boundary, but running as code lands — is moving from aspirational toward baseline: increasingly the default on teams that have wired it in, though not yet universal. The marginal cost of a careful read went to nearly nothing, and the read happens while the context is still warm.

But this one comes with an emergent problem worth naming, because it bites teams that adopt the tooling without rethinking the model. PRs are getting larger and arriving faster under AI-assisted development. An agent can produce a 2,000-line change in an afternoon. If your review model still routes everything through a human approver at the end, that human is now the rate limiter, drowning in volume they didn’t generate and can’t realistically read. AI review on every commit is part of the answer. The harder part is restructuring what the human reviews — architecture, intent, the decisions a model shouldn’t make alone — and letting the machine handle line-level correctness continuously. Adopt the tool without rethinking the workflow and you’ve just built a faster way to overwhelm your best reviewer.

5. Pair Programming

Pairing always looked expensive in the most obvious way: two engineers, one task, double the salary against a single unit of output. That intuition was wrong on the numbers — the measured overhead from the pair-programming studies was closer to 15%, often recovered through fewer defects — but the 2× gut feeling is what drove the decisions. The benefits were real — knowledge transfer, real-time review, fewer dumb mistakes — but the perceived math meant most teams reserved it for critical paths, gnarly bugs, or onboarding a new hire. A luxury, rationed.

The pair is now a human and an agent, and it’s available to every engineer continuously, not rationed to the critical path. The knowledge-transfer benefit generalizes — the agent can explain unfamiliar parts of the codebase on demand. The real-time-review benefit generalizes — a second set of eyes on every line, every time, without scheduling two calendars. The economics that made pairing a rationed luxury simply don’t apply when one half of the pair has near-zero marginal cost.

Worth a caveat: agent-as-pair is genuinely good at the review and explanation half of pairing, and weaker at the part where a human partner pushes back on a bad design before you’ve written a line. You still need humans pairing with humans for that. But the day-to-day, line-by-line version of pairing just became free, and that’s most of what pairing was for.

6. Mutation Testing

Coverage numbers lie, and most engineers know it. Eighty percent line coverage tells you eighty percent of your lines got executed by a test — not that any of those tests would notice if the behavior broke. Mutation testing is the honest measure: it deliberately introduces bugs (flip a comparison, drop a line, change a constant) and checks whether your test suite catches them. If a mutant survives, you have a test that runs code without actually validating it.

Mutation testing was always the gold standard and almost never run continuously, for one reason: it’s computationally expensive. You’re effectively running your whole suite many times over, once per mutation. On a real codebase that’s brutal. So it lived in research papers and the occasional heroic CI job that someone eventually disabled for being too slow.

That constraint is mostly gone — compute is cheap and parallel, and we got more comfortable spending it. And the practice arrived right when we suddenly need it most. AI-generated tests have a characteristic failure mode: they drift toward coverage metrics without meaningful assertions. The model writes a test that calls the function, exercises the path, and asserts almost nothing of substance — green checkmark, zero protection. Coverage looks great. Mutation testing is the thing that catches exactly that. It’s the verification layer for a verification layer, and it matters more now than when it was invented, because now a machine is writing the tests.

7. Living Documentation

Documentation was always supposed to be a first-class artifact. It almost never was, and the reason is by now familiar: writing docs is tedious, and the penalty for stale docs accrues slowly and lands on someone else. So docs rotted. Every team has a wiki that’s a graveyard of half-true pages from two reorgs ago.

AI changes both halves of the equation at once. It generates docs from code, so the writing cost drops. And it consumes docs as context, so the docs earn their keep immediately — a well-documented codebase is a measurably more useful codebase for an agent working in it. The ROI is immediate and bilateral, same structure as ADRs.

There’s a quiet shift hiding in there. Documentation used to be written for humans who’d mostly never read it. Now it’s also written for the agent that will read it on every task, which means stale docs don’t just confuse a future engineer — they actively degrade your tooling today. The feedback loop tightened from months to minutes. That’s the kind of change that actually moves behavior, because the cost of skipping it shows up now instead of later.

8. Runbook Generation from Incidents

On-call always leaned too hard on tribal knowledge. The person who knows why the payment service wedges at 3 a.m. is asleep, on vacation, or left the company last spring. Writing a runbook after each incident was obviously the right move and reliably the thing nobody did, because the incident was over and everyone wanted to go back to bed.

Incidents become runbooks automatically now. The agent has the incident timeline, the chat transcript, the commands that resolved it, the postmortem — and it can turn that into a structured runbook while the details are fresh, without asking an exhausted engineer to relive the night. The cost that used to fall right when motivation was lowest now falls on a system that doesn’t get tired or resentful.

I’d treat the generated runbook as a draft a human still signs off on, not gospel. But “imperfect draft, reviewed in five minutes” beats “blank page nobody ever fills in,” and that was always the real competition.

9. Property-Based Testing

Example-based tests check the cases you thought of. Property-based testing is stronger: you specify the invariants a system must always satisfy — reversing a list twice returns the original, a serialized-then-deserialized object equals the original, the account balance never goes negative — and the framework generates hundreds of adversarial inputs trying to break them. QuickCheck pioneered the approach; it finds the edge cases you’d never have written by hand.

It never went mainstream outside a few communities, and the bottleneck wasn’t tooling — good property-based libraries exist for most languages. The bottleneck was writing good property specifications. Identifying the right invariants requires deep domain reasoning: you have to understand the system well enough to state what must always be true, which is harder than writing a few example cases. Most engineers, under pressure, defaulted to the easier thing.

This is where AI helps in a way that’s less obvious than “it writes the code.” A model can generate property suites from a spec, and — more usefully — it can reason about what invariants a system should satisfy in the first place, surfacing properties you hadn’t articulated. That’s the expensive, judgment-heavy part it actually offloads. Combined with mutation testing to keep the generated properties honest, you get a testing approach that was always more powerful than example-based testing and was always too expensive in human reasoning to adopt widely.

The Catch

I’d be selling you something if I stopped there, and the data won’t let me.

In July 2025, METR ran a randomized controlled trial with experienced open-source developers working on real tasks in repositories they knew well. The developers expected AI assistance to speed them up. It slowed them down — by roughly 19%. METR’s February 2026 follow-up found that gap narrowing, and reversing on some measures, as the same kind of developers gained real experience with the tools — which is to say the 19% was a snapshot of the unfamiliar, undisciplined path, and it closes precisely as people pick up the habits this piece is about.

That finding is real and it isn’t a contradiction of everything above. It’s the missing condition. Every practice in this piece works because it imposes structure on the agent — TDD as a guardrail, the spec as input, the ADR as context, mutation testing as the check on the check. Used that way, with discipline, AI is constrained toward correctness. Used the other way — as autocomplete, as a vibe-coding partner you don’t supervise — you get more code, faster, with less correctness and a slower path to done once you account for the cleanup. The METR developers, working in code they already understood deeply, may well have been paying exactly that tax: accepting plausible suggestions that took longer to vet and fix than writing it themselves would have.

So the inversion isn’t automatic. The cost of patience dropped to zero, which makes the rigorous path finally affordable. It does not make the undisciplined path good. If anything it makes discipline more important, because a tool that produces plausible output at high volume is precisely the tool that most needs a guardrail you can’t talk your way past. A red test bar doesn’t care how confident the model sounds.

What To Do With This

The interesting question was never “what does AI make possible.” That list is enormous, mostly speculative, and not very actionable. The better question is the one this whole piece is built on: which practices did we already know were right, argue about for decades, and quietly give up on?

That list is short, specific, and yours to write. Go pull your own team’s “we should really do this but we don’t” backlog — the standing items in retros that everyone agrees with and nobody owns. I’d bet most of them have the same economic shape: high upfront cost, distant payoff, killed by human impatience under deadline. Test coverage on the legacy module. The runbooks. The ADRs for the three decisions everyone keeps re-litigating. The integration tests that would’ve caught last quarter’s outage.

Run each one through a single question: was this abandoned because it was wrong, or because it was expensive in human patience? The wrong ones, leave abandoned. The expensive-in-patience ones just got cheap. Those are the ones to pick back up first.

The impatience tax got repealed. The disciplines it used to make unaffordable are sitting right there, mostly unchanged, waiting for someone to notice the price changed.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

The Other Shoe Has Dropped — Why enterprise AI bills don’t match the per-token price collapse
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

The Era of the Leader/Practitioner

Robert Matsuoka — Mon, 01 Jun 2026 11:31:28 GMT

Something shifted in the last year, and it took me a while to name it.

A growing number of people are running organizations while still doing real hands-on technical work. Not as a hobby, not on weekends, not as a vanity exercise to keep their commit graph green. They are building things their teams depend on — and they are doing it as a deliberate part of the job, not in the cracks between meetings. The work has a specific shape. It is rarely a production feature. It is the layer underneath: developer productivity tooling, internal services, agentic harnesses, MCP connectors, the infrastructure that unblocks everyone else.

For most of my career this combination didn’t really hold together. You could code or you could lead, and the moment you tried to do both seriously, one of them rotted. I’ve watched plenty of technical executives keep a foot in the codebase and slowly become the bottleneck everyone routed around politely. The pattern was familiar enough to be a warning.

What changed is not that leaders suddenly got more disciplined. It’s that the time cost of meaningful technical contribution collapsed. Agentic coding made a hybrid role viable that wasn’t viable before — and the more I look at it, the more I think this isn’t a new invention at all. It’s the recovery of a very old idea that modern specialization interrupted.

I’m writing this as someone living in the middle of it. I spend roughly 30% of my time coding. And when I say coding, I mean directing a team of agents — much closer to that than hands-on work, which is probably the whole point. That’s not a full-time IC’s week, and it isn’t meant to be. It’s enough to stay close to the work that matters and to build the enabling infrastructure I think is worth my own hands on the keyboard.

TL;DR

A distinct role is emerging: leaders who run organizations and still do deep technical work — specifically enabling/institutional work (tooling, harnesses, internal services), not critical-path product features.
This satisfies Charity Majors’ actual advice. Her line was never “stop coding.” It was “stop writing code in the critical path.” Enabling work fits that exactly.
The integration of strategist and practitioner has deep cross-cultural precedent — Japan’s bunbu-ryōdō, Rome’s Marcus Aurelius, China’s wen-wu, the Renaissance polymath, Mattis’s “warrior monk.” The modern role is a recovery, not a novelty.
Agentic coding is what makes it newly viable: focused sessions now deliver output that once required sustained, uninterrupted immersion. The Anthropic 2026 data shows ~27% of AI-assisted work is work that “wouldn’t have been done otherwise.”
Directing agents feels like delegation — the same skill leaders already use with human reports. Which is why experienced leaders adapt to it more naturally than juniors do.

The pattern, named

The difficulty with the coding executive was never philosophical. It was attentional.

Charity Majors mapped this years ago in The Engineer/Manager Pendulum, and the piece holds up because she was precise about the mechanism. Management is interruptive by design — your job is to be available, to unblock, to absorb the chaos so your team doesn’t have to. Serious engineering is the opposite. It requires blocking interruptions for long enough to hold a complex system in your head. Two incompatible attention modes. Try to run both at once and you do neither well.

But here is the part people skip when they quote her. Majors never said managers should stop coding. Her actual advice was sharper: don’t write code in the critical path. Don’t be the person others are waiting on. Stay technical, stay sharp, just don’t make yourself a dependency that blocks shipping. “The best frontline eng managers in the world,” she wrote, “are the ones that are never more than 2-3 years removed from hands-on work.”

That distinction carries the whole argument. Because there is a category of technical work that is consequential without being critical-path, and it turns out to be exactly the work senior people are best positioned to do.

Call it enabling work. Internal tooling. Developer productivity infrastructure. Agentic harnesses. MCP services that other teams plug into. Architectural prototypes that prove a direction before anyone commits to it. None of this is what blocks a release on Thursday. All of it multiplies whoever comes after. It tolerates interruption — you can pick it up Tuesday afternoon and put it down when a real fire starts — precisely because nobody is standing at your desk waiting for it.

This is also where the industry is putting its money. Gartner named platform engineering a top strategic trend for two years running and projects that 80% of large engineering organizations will run dedicated platform teams by 2026. The structural reason a leader can work here without becoming the bottleneck is built into the definition of the domain: its output unblocks others rather than blocking them.

Will Larson’s Staff Engineer archetypes circle the same territory without quite landing on it. His “Architect” sits in a permanent argument — some organizations demand the Architect stay deep in the code, others forbid it. The leader/practitioner resolves that argument by relocating it: deep in the enabling and infrastructure work, absent from production product code. Not the pendulum, not the staff IC. A real hybrid, and a newly coherent one.

This is not new

Here is where I want to slow down, because the most interesting thing about this role is how old it is.

The idea that a leader should be both a strategist and a practitioner — not separate modes to alternate between but a single integrated way of operating — shows up independently across at least five civilizations. That kind of convergence usually means a culture has found a durable answer to a real problem.

Japan gave it a name: bunbu-ryōdō (文武両道), the way of both the literary and the martial. Bun is letters, cultivation, strategy. Bu is the martial, the active, the practitioner’s hand. Ryōdō means both ways, together — not balanced, not traded off, but held at once. By the mid-fourteenth century the dual-talented warrior was already established as the model leader, and during the Edo period the Tokugawa shogunate made it official policy for the samurai class. The phrase that survives captures the stakes: culture without power is ineffective, and power without culture is barbarous.

The archetype is Miyamoto Musashi. Undefeated in more than sixty duels, often fighting with a wooden sword against live steel. He founded a two-sword school, and in the last months of his life he wrote The Book of Five Rings in a cave. He was also a recognized master of ink painting and calligraphy — his Shrike on a Withered Branch survives as a designated Important Cultural Property of Japan. The same hands that won sixty duels produced fine art a nation still protects. He didn’t oscillate between the sword and the brush. He held both, and each sharpened the other. “When I apply the principle of strategy to the ways of different arts and crafts,” he wrote, “I no longer have need for a teacher in any domain.” Mastery in one discipline illuminating all the others.

Rome had Marcus Aurelius, the philosopher-king made historical rather than theoretical. He ran the empire and commanded its armies on the Danube, and he wrote the Meditations in the war camps — fragmentary notes to himself, composed in the middle of campaigning and administration. That book was never published philosophy. It was a working journal, the most powerful man in the ancient world writing to stay grounded while doing the job.

China institutionalized the same ideal as wen and wu — civil cultivation and martial capability — and ran it for roughly thirteen centuries through the scholar-official class. Zeng Guofan is the canonical case: he rose through the imperial examinations to high Confucian office, then built and commanded an army of more than a hundred thousand, reportedly keeping a diary on Neo-Confucian ethics even as he directed the campaigns. The integration wasn’t left to personal taste. It was built into the examination and career structure.

And the archetype still lands today. James Mattis earned the nickname “Warrior Monk” — battlefield commander and devoted reader, a 7,000-book library, the Meditations carried into combat. The chain from Aurelius to Mattis is literal: the same book, eighteen centuries apart. That we still reach for “warrior monk” as a compliment for a leader tells you the integration never stopped resonating.

Across all of it, the answer is the same. The contemplative and the active were not specializations to assign to different people. They were a single discipline, each half informing the other. Modernity — with its org charts, its clean role boundaries, its professional specialization — interrupted that. The leader/practitioner is not a tech-industry novelty. It’s an old integration becoming feasible again.

Why now

So what actually changed? Not the wisdom. The economics.

The thing that made the coding executive a bad idea was the attention math. Serious technical work demanded long, unbroken stretches of focus — the exact resource a leadership schedule cannot reliably provide. You cannot design a system in the fifteen minutes between a board prep and a one-on-one. The pendulum was a real constraint, not a failure of will.

Agentic coding changes that math directly. The unit of work moved up a level. Instead of holding every implementation detail in working memory across a four-hour session, you specify intent, direct an agent, review what comes back, correct course, and direct again. A focused 30-minute session now produces what used to require an afternoon of immersion — not because the thinking got easier, but because the implementation cost collapsed.

The Anthropic 2026 Agentic Coding Trends Report puts numbers on the shift. Average session length has climbed to 23 minutes in the agentic era, up from about 4 in the autocomplete era — the work got denser, not just faster. 78% of Claude Code sessions now involve multi-file edits, up from 34% a year earlier. Teams running multi-agent workflows report 2–4x faster delivery from task creation to deployment. And the figure that matters most for this argument: roughly 27% of AI-assisted work consists of tasks that “wouldn’t have been done otherwise” — the scaling projects, the nice-to-have tools, the exploratory infrastructure that was never quite worth the manual hours.

That 27% is the enabling work. It is the category that lives or dies on time cost, and it’s the category a leader/practitioner is best placed to take on.

The arithmetic is what makes 30% credible. I documented a 6–10x multiplier on focused technical sessions in The Irreducibles earlier this year — a project I estimated at 150–200 billable hours compressed into roughly 50–70 hours of wall-clock time, most of which wasn’t coding at all. If directed work runs several times faster than hand-coding, then a day and a half a week can produce what once consumed a full-time engineer’s week. That’s not a marginal gain. It’s a change in what’s structurally possible.

There’s another dimension that doesn’t show up in the productivity numbers: the work itself is unstable. Agentic coding patterns are still shaking out. There aren’t many experienced practitioners, the field is moving fast, and we don’t yet have good consensus on which patterns are load-bearing and which are fashion. A manager who’s only reading about it can’t make that distinction on behalf of a team. You have to be in it to know.

There’s a counterintuitive wrinkle worth naming: the people best positioned to exploit this are the senior ones. A University of Chicago working paper from late 2025 found experienced developers were 5–6% more likely to succeed with AI agents for every standard deviation of work experience, largely because they worked plan-first — laying out objectives, alternatives, and steps before invoking the tool. That’s the opposite of the assumption that AI flattens the seniority curve. Expertise improves your ability to delegate to a model for the same reason it improves your ability to delegate to a person. AI doesn’t change what senior engineering is. It reveals what it always was.

Directing agents is delegation

This is the part that interested me the most, and it’s the bridge between the leadership job and the technical one.

A year or so ago, working with Claude Code felt like coding. Now it feels like delegating. I wrote about that shift when it first became undeniable — the move from being a programmer who uses AI to being something closer to a technical project manager who directs it. Anthropic’s report uses the same vocabulary, describing engineers moving “from writing code to orchestrating the systems that write it.”

What I didn’t fully appreciate at the time is how directly that maps onto the muscle leaders already have. Delegating to an agent feels, in practice, just like delegating to a human engineer. You frame the problem, set the constraints, hand it off, and come back to assess the result. Give a directive, walk away, return to completed work. That loop is the daily reality of management. Leaders developed it because they had to, and it transfers to agents almost without friction.

So the leader/practitioner doesn’t have to become a coder again in the old sense. The skill in demand is judgment plus delegation, and that’s the skill leadership has been building all along. The hands-on knowledge tells you what to ask for and whether the answer is any good. The delegation instinct does the rest.

This is why the enabling work and the agentic tools fit together so cleanly. Enabling work tends to be well-defined, non-user-facing, and long-horizon — exactly the profile agents handle well and exactly the profile that tolerates a leader’s interrupted schedule. The hands-on contribution mostly takes the form of specifying constraints and patterns, which is what I’ve called pattern mastery: when you write the pattern down, you’ve written the spec, and the spec multiplies everyone else’s output.

What it actually looks like

Let me ground this without turning it into a war story.

The concrete examples from my own work are the kind of thing I mean. I built an agentic harness — the orchestration layer I argued is the real determinant of AI coding outcomes, where the same model can swing more than a quality point depending on the scaffolding around it. I built MCP services, the org-specific connectors that replaced a handful of standalone tools in a few hours of directed work each. None of that was a production feature. All of it was infrastructure other people now depend on.

Here’s a detail that may make the point. I now have an “AI architect” on my leadership team helping maintain the very infrastructure I originally built — not just the harness, but our inference relationships, our training program, office hours, the real human work I no longer have the time, or the right, to be doing myself. And I expect to hand off more over time. The enabling work I do today partly becomes the system that does tomorrow’s enabling work. That handoff is the role in miniature: you build the thing that multiplies the team, then you put someone in place to build the next version.

The proportion matters. Around 30% hands-on keeps judgment fresh without putting me in the critical path. Even full-time senior ICs aren’t full-time coders — Bain’s Jue Wang, quoted in MIT Technology Review last December, put developer coding time at 20–40%, with the rest going to analysis, strategy, and the surrounding work. A leader at 30% isn’t doing something exotic. They’re spending their technical budget on the layer where it compounds.

The decision is not “how do I find time to code.” It’s “what enabling work is worth my own hands?” Those are different questions. The first leads to the bottleneck I watched so many executives become. The second leads somewhere useful.

The choice

I’ll resist overselling this, because it isn’t for everyone and it isn’t automatic.

This is a deliberate role, not a default. Staying technically current costs ongoing investment, and the work is often invisible — enabling infrastructure rarely shows up in a quarterly review the way a shipped feature does. The role is easy to misread, too. From the outside, a CTO who codes can look like a CTO who hasn’t let go. The defense against that reading is the discipline Majors named: stay out of the critical path. Build the multipliers, not the blockers.

The returns are real, though. Fresh judgment — the kind that lets you evaluate not just what and why but how. Trust from engineers who see you in the work rather than above it. And institutional infrastructure that makes the whole team faster, built by the person with both the technical depth and the positional authority to prioritize it.

There’s a closing note in the history worth keeping. Bunbu-ryōdō wasn’t only a personal aspiration. The Tokugawa shogunate institutionalized it — built career and class structures around the assumption that a leader should be both. China did the same with its examination system. We’re not there yet. For now, the leader/practitioner is an individual choice, made one person at a time, made viable by tools that finally collapsed the cost of staying hands-on.

But the precedent suggests where this could go. When a way of working proves durable, organizations eventually build structures around it. The era of the leader/practitioner is early. It is also, I’d argue, a return — to an integration we knew was valuable long before we had the means to make it practical again.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

From Coding with AI to Managing AI — When agentic coding starts to feel like delegation
It’s The Harness, Stupid! — Why orchestration quality dominates AI coding outcomes
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

The Other Shoe Has Dropped

Robert Matsuoka — Fri, 29 May 2026 11:31:28 GMT

Two stories from the last two weeks. Uber burned through its entire 2026 AI budget in four months on Claude Code, with COO Andrew Macdonald telling the Rapid Response podcast that the link between that spend and shipped consumer features “is not there yet.” The Information had the underlying numbers a few weeks earlier: engineer adoption from 32% to 84% between December and March, heavy users running $500–$2,000/month in tokens, and CTO Praveen Neppalli Naga torching $1,200 in a two-hour demo. Same week, Microsoft told thousands of engineers in its Experiences + Devices division that their Claude Code access is going away. Windows Central, summarizing The Verge’s Notepad scoop, has the cutoff at June 30 — end of fiscal year — with cost as the actual driver even though EVP Rajesh Jha framed it publicly as convergence on Copilot CLI.

Two of the most AI-forward enterprises on the planet, same tool, same week. The “AI is failing” takes were live within hours.

I don’t buy that framing.

The headlines are getting it wrong. Uber didn’t cancel anything — adoption ran ahead of the budget and the company blew its annual spend keeping up. That’s a planning failure, not a verdict on the tool. Microsoft didn’t divorce Anthropic either; they’re still consuming Claude through Azure Foundry and M365 Copilot. What they cancelled is a specific license — Claude Code at the engineer-seat level — because engineers preferred it over GitHub Copilot CLI and the division was paying for that preference.

What both stories show: AI is a new tool and we haven’t learned to use it well yet. The teams over budget pointed it at problems it wasn’t the cheapest way to solve, then let it decide for itself how much work to do per task.

I’ve made the cloud parallel here before. Early cloud was expensive and misused. Lift-and-shift workloads routinely ran two or three times their on-prem cost — I watched that play out across teams I ran, and it took years to correct through architecture. Then the industry learned: right-sizing, reserved instances, autoscaling, serverless where it fit, on-prem where it didn’t. The bills came down. Not because compute got dramatically cheaper, but because we got more careful about what we asked the cloud to do. AI is in the same phase. Cheap per-token, expensive per-task, and the gap is architectural.

A few weeks ago I ran controlled head-to-head tests on Opus 4.6 and Opus 4.7 against identical coding tasks. Both models passed every test. Opus 4.7 cost 3.6× more to do it. Same outcomes, same rate card, dramatically more tokens.

Finout’s analysis of production deployments tells the same story at scale: up to a 35% cost increase overnight, driven by tokenizer changes that don’t show up on the per-token rate card. Not one team’s bad luck — the shape of the bill across the enterprise AI buyer base right now. The second of two shoes on AI economics.

I wrote about that version-to-version cost drift in detail. Providers can collapse per-token prices in public while the per-task bill drifts upward in private. The first shoe was the per-token price collapse that made everyone optimistic. The second is the behavioral and architectural cost overhang now landing on quarterly P&Ls.

TL;DR

Per-token costs at GPT-3.5-equivalent performance are down roughly 280× since late 2022, per Stanford’s AI Index 2025. Vendor revenue tells the opposite story: Anthropic’s annualized revenue went from $1B in January 2025 to $30B by April 2026 — a 30× move in 15 months, coming from enterprise inference, not consumer subscriptions.
Gartner’s April 2026 survey: just 28% of AI use cases fully meet ROI expectations, 78% of IT leaders report material AI charges that didn’t show up in any procurement model.
Gartner estimates agentic workflows consume 5–30× more tokens than equivalent chatbot interactions; Stanford’s Digital Economy Lab puts the upper bound for coding agents at 1,000×. The cost driver isn’t the model — it’s the workflow architecture wrapped around it.
Two patterns hold the line in production. Search-first architectures put inference at the end of a deterministic pipeline. Consolidated single-shot designs replace multi-call chains.
Inference is a power tool, not a default. Use it with specific ROI goals per call, apply it to code solutions rather than to directly solve problems, and bound it with deterministic structure on both ends.

Why the Per-Token Savings Didn’t Reach the Invoice

Per-token economics of frontier models have been collapsing for two years. Stanford’s AI Index 2025 puts the decline at roughly 280× from a late-2022 baseline at GPT-3.5-equivalent performance. Most enterprise budget conversations in 2024 started from that headline. The implicit assumption: bills should be going down.

They’re not. The clearest read comes from the vendor side. Anthropic’s annualized revenue went from $1B in January 2025 to $30B by April 2026, with roughly 80% from enterprise and API usage rather than consumer subscriptions. Anthropic now discloses 1,000+ customers spending more than $1M per year — a cohort that doubled in under two months — alongside roughly 300,000 business customers. The mid-tier ($100K–$1M/year) grew 7× year over year.

Menlo Ventures’ 2025 State of Generative AI report cross-checks at the market level: enterprise GenAI spend tripled to $37B in 2025, with LLM API consumption alone at $8.4B by mid-year. The high tier shows up in named deals: Snowflake’s $200M multi-year partnership implies a $50–70M annual run rate from one customer, and Deloitte is deploying Claude across 470,000 employees.

For a typical enterprise running multiple production AI workloads, $500K–$2M per year is now the realistic floor. Fortune 100 is running $10M–$50M+, the most AI-intensive past $100M. The Gartner numbers point the same direction: just 28% of AI use cases fully meet ROI expectations and 20% fail outright, and 78% of IT leaders report material AI charges that didn’t show up in any procurement model.

The per-token chart is real. The invoice is also real. What closes the gap is behavior. Three behaviors specifically.

Models do more work per task. Reasoning models reason. Agentic loops loop. The prompt that used to consume 4K tokens now consumes 40K because the assistant explores, plans, second-guesses, and verifies. Some of that is valuable. Much of it is the model performing thoroughness in a way that costs you money. The Opus 4.6-to-4.7 jump I documented earlier: same task, same outcome, 2.9× more output tokens and 4.8× more cache reads.

Workflows fan out. A “single” task in a modern agentic system might trigger a planner, researcher, coder, reviewer, and summarizer. Each makes its own LLM calls over overlapping context. Gartner’s March 2026 analysis puts agentic workflows at 5–30× the token consumption of an equivalent chatbot ask. Stanford Digital Economy Lab’s April 2026 arXiv paper goes further: coding agents can consume 1,000× more tokens than equivalent chat completions. The agent isn’t more expensive because it’s smarter. It’s more expensive because it’s louder.

Context windows fill themselves. Long context is a feature in marketing and a bill in practice. In our own enterprise Claude.AI usage — 82,852 messages from 329 employees over 3.5 months, audited via the Anthropic Compliance API — the average request carried 366,000 input tokens, mostly from 10-turn conversations dragging accumulated history forward into every new turn. Most production systems I’ve audited show the same fingerprint: pipelines paying for context they aren’t actually using.

None of this is fraud and none of it is mysterious. It’s the natural consequence of letting probabilistic systems decide how much work to do on every call. The savings from cheaper tokens were real. They just got consumed by an order of magnitude more tokens per task.

What I’ve Found Shipping These Systems

The teams handling this well aren’t the ones cutting AI usage. They’re changing the shape of how they use it.

The pattern I keep coming back to: treat inference as the expensive step at the end of a mostly deterministic pipeline. Do the cheap, structured work in code. Reserve the model call for the part that actually requires judgment. Then bound the call hard — context budget, output budget, quality gate on whether the call even runs.

Two examples from systems I’ve been building illustrate this from different angles.

Example 1: Code-Intelligence — Search-First Architecture

The naive version of a code review tool is obvious: dump the changed files into Claude, ask for a review. That works. It also costs roughly $0.03 per operation, scales linearly with repo size, and produces a lot of review output you didn’t need. Claude.AI offers a code review service — it ended up costing us thousands a month for just a few repos. Augment Code offers a well-regarded one as a GitHub app, but charges a platform fee (a meaningful fraction of our Anthropic spend) just to connect.

So we built our own. It leverages a multimodal search/RAG/KG engine I’d already built, so this wasn’t from scratch.

The version I actually ship uses a multi-tier search pipeline with the LLM call at the very end:

Stage 1: Vector Search    (~$0.0002 per query, semantic similarity)
Stage 2: BM25 Reranking   (~$0.0001 per query, lexical relevance)
Stage 3: Static Analysis  (~$0.0001 per query, AST + symbol resolution)
Stage 4: Quality Gate     (free, deterministic threshold check)
Stage 5: Single LLM Call  (~$0.03 per call, only if Stages 1-4 passed)

The first four stages cost about $0.0004 combined. They do the bulk of the work: deciding what code is actually relevant, ranking it, pulling structural relationships, and deciding whether the result is even worth asking an LLM about.

Hard budget controls run through the whole pipeline:

# Budget enforcement, not aspiration
MAX_CONTEXT_FILES = 6          # cap on what we send to the model
MAX_REVIEW_WORDS = 500         # cap on what the model returns
RELEVANCE_FLOOR = 0.005        # quality gate before calling the model

if combined_relevance_score < RELEVANCE_FLOOR:
    # no point spending $0.03 to get a review of weakly-related code
    return SkipReason("below relevance floor")

context = select_top_n(ranked_results, MAX_CONTEXT_FILES)
review = llm.review(context, max_output_tokens=MAX_REVIEW_WORDS * 1.4)

The RELEVANCE_FLOOR check is the part to underline. A meaningful percentage of review requests in real codebases don’t justify an LLM call at all — the changes are mechanical, the related code trivial, or the search signal weak enough that whatever the model says will be hallucinated context. Refusing to spend $0.03 on those cases is where most of the savings come from.

Rough economics across a quarter of usage:

Approach Cost per operation LLM calls per 1K operations Direct “review the diff” ~$0.030 1,000 Search-first with gates ~$0.0034 average ~430

About 80% of the workflow logic ends up deterministic: search, ranking, static analysis, gating. The model handles the last 20% — judgment on curated context. Same outcome from the user’s perspective, roughly an order of magnitude cheaper, with more predictable failure modes because most of the pipeline is debuggable code rather than prompt behavior.

The limitation: this is more work than wiring up a single LLM call, and the gates are only as good as your search infrastructure. The payoff is on the cost and determinism side, not on speed of initial implementation.

Example 2: duetto-intelligence — Context Injection Instead of Replacement

The second pattern comes from duetto-intelligence, internal tooling I’ve been building against that same enterprise Claude.AI usage — 82,852 real employee messages over 3.5 months, not a thought experiment. The problem here isn’t “should we call the LLM at all.” It’s: given that our people are already routing structured-data questions through a $0.274-per-request multi-turn Sonnet conversation, what’s a cheaper path that doesn’t degrade the answer?

The audit data made the gap concrete. Average request: 366K input tokens, ten-turn conversation, $0.274 to Anthropic. The same query answered through a Haiku single-pass against deterministically-retrieved internal data: $0.0009. A 300:1 cost ratio on the slice of traffic about structured product knowledge, CRM/account prep, JIRA, people and org lookups.

Not all traffic. Somewhere in the 35–40% range based on classified samples. About half of remaining queries genuinely need full Sonnet or Opus reasoning — writing, debugging, free-form analysis — and shouldn’t be intercepted at all.

The framing matters, because it’s easy to mis-read this as “replace Claude with a smaller model.” It isn’t. duetto-intelligence acts as a context injection layer in front of the user-facing model. When a query has structured-data intent, we route a sub-query to DI, get back a bounded structured result, and inject that into the prompt the larger model sees. The expensive model still does the reasoning — it just stops being responsible for the deterministic data retrieval it’s bad at and expensive for.

The naive design for the routing layer looks like this:

1. Classify the user's intent             → LLM call (~150 tokens)
2. Plan which subsystems to query         → LLM call (~250 tokens)
3. Call subsystem A, summarize response   → LLM call (~300 tokens)
4. Call subsystem B, summarize response   → LLM call (~300 tokens)
5. Synthesize a final answer              → LLM call (~250 tokens)
                                          Total: ~1,250 tokens, 5 calls

Each call is plausible on its own. Together they’re a tax on every user interaction. Latency stacks linearly with calls, and any one of the five can hallucinate in a way that corrupts the rest of the chain.

The consolidated design replaces three steps with deterministic code:

1. Classify intent                  → LLM call  (~80 tokens, tight tagger prompt)
2. Fan-out to subsystems            → code      (0 tokens, intent → call map)
3. Consolidated synthesis           → LLM call  (~200 tokens, structured input)
                                       Total: ~280 tokens, 2 calls

The trick is the intent classifier. It produces a tag from a fixed vocabulary of 87 tags — revenue.query.ytd, forecast.compare.year_over_year, account.lookup.contact, and so on. Each tag maps deterministically to a set of downstream calls in plain Python. No LLM in the routing step. The model isn’t asked “what should we do?” It’s asked “what is the user asking about?” — a much smaller, bounded question.

We validated against a 50-query test corpus drawn directly from the compliance data — real questions people had asked the model in production. After tuning, 100% of those queries land on the fast path with no LLM call required for routing. That’s the proof the deterministic-discipline part holds at the boundary; the routing isn’t quietly falling back to a second model call to bail itself out.

Budget enforcement is explicit in every prompt template:

SOURCE_CHAR_BUDGET = 600    # per data source pulled into context
OUTPUT_TOKEN_BUDGET = 200   # cap on synthesis response
INTENT_TAG_VOCAB = load_intent_taxonomy()  # 87 tags, versioned

def synthesize(intent_tag: str, sources: list[Source]) -> str:
    trimmed = [s.truncate(SOURCE_CHAR_BUDGET) for s in sources]
    return llm.complete(
        prompt=template(intent_tag, trimmed),
        max_tokens=OUTPUT_TOKEN_BUDGET,
    )

The economics, against measured baselines rather than estimates: current Claude.AI Chat spend across the 329-user population runs $15,264 over 3.5 months. Roughly $4,360/month, driven by that $0.274-per-request multi-turn average. If DI intercepts the 35–40% of traffic that’s structured-retrieval underneath, projected savings come in around $1,500–1,700/month, or $18–20K/year on this single user population. The leverage isn’t from picking a cheaper model. It’s from refusing to pay Sonnet rates to answer questions a deterministic system already has the data for.

The intent vocabulary is the contract. New capability means a new tag, a new downstream mapping, a new prompt template. The model never has to invent structure on the fly. This is what people mean by “use the LLM to code solutions, not to solve problems directly” — the routing logic lives in code, the tagger is a thin call, the synthesis is bounded.

The source-character budget matters more than the output budget. The compliance audit confirmed it: production overspend is on the input side. 366K input tokens against a few hundred output. Models will happily consume whatever context you hand them. Trimming at the source — 600 characters per source, no exceptions — is how you keep per-call cost from drifting upward as the system gets more capable.

The limitation: this only works on the routable slice. The pattern isn’t “eliminate inference.” It’s “stop spending $0.274 to answer questions that have a structured answer at $0.0009.”

Why “Just Use a Cheaper Model” Doesn’t Save You

A reasonable objection: aren’t the cheaper, smaller models supposed to handle this? Why not route everything to Haiku or an open model and call it solved?

The pricing math seems to support it. The behavioral math doesn’t.

Two things go wrong when you swap a cheaper model into an unstructured workflow. Cheaper models are usually less efficient per task — more turns to converge, more exploration, more hallucination, which means more retries and verification calls. A model 5× cheaper per token can run 1.5–2× more expensive per completed task if your workflow lets it spin.

And the workflow itself is where most of the cost lives. The 5–30× multiplier is structural, not modal — it exists regardless of which model you point at it. Switching from Sonnet to Haiku inside an unbounded agent loop changes the per-token cost. It doesn’t change the loop.

Model choice is a 2–5× lever. Architecture choice is closer to an order-of-magnitude lever in the systems I’ve shipped — consistently larger than what model swaps deliver. Most teams are over-tuning the model selection and under-tuning the structure around it.

The default assumption — including from vendors with strong incentives to sell you more tokens — is that the answer to AI cost is buying inference more cleverly. The actual answer is using inference less, more deliberately, with hard bounds on what each call is allowed to do.

How I Think About Inference Now

Inference is a power tool. Not a default. You don’t reach for it when a search query, a regex, or a switch statement would do. You reach for it when you need probabilistic judgment over unstructured input. Every call you don’t make is the cheapest call.

Use it to code solutions, not to solve problems. The highest-leverage use of LLMs in my workflow is generating the deterministic code that then handles the workflow without further LLM calls. A model that writes you a 50-line classifier is more valuable than a model that acts as the classifier on every request forever. The first costs tokens once. The second costs tokens every transaction for the life of the system.

Wrap every call in a budget. Context budget on the input, token budget on the output, quality gate on whether the call runs at all. Treat the LLM call as you’d treat a paid API with rate limits and SLA penalties. Because it is.

Set specific ROI targets per call. “AI-assisted code review” is too coarse to optimize. “Reviewing files with relevance score > 0.005, capped at 6 files, returning 500 words” is something you can measure cost-per-outcome on. Even loose ROI math at the call level surfaces where you’re paying for theater.

Treat behavioral cost as the primary risk. Model rate cards will keep coming down. They are not your problem. Your problem is what your pipeline asks of the model and what the model decides to do once asked. That’s the line item that grew while the unit cost dropped 280×. That’s the shoe that just dropped.

What This Means If You’re Running an AI Budget

Three things to look at, in order of how much they’ll move the line item.

Audit the call graph, not the rate card. Pull a representative day of production traffic and trace the actual LLM calls per user task. Count them. Most teams find a handful of workflows producing the majority of cost, and most of those have 2–4 LLM calls that could be replaced by deterministic code. That’s the consolidated-design pattern from the duetto-intelligence example. 50–80% reductions are common when you actually look.

Put quality gates in front of inference. For any workflow where the LLM call is expensive and the input quality is variable, add a deterministic check that decides whether the call is worth making. That’s the search-first pattern from the code-intelligence example. The savings come from the calls you don’t make, which never show up on the invoice.

Set hard context budgets and enforce them in code. Per-source character limits, per-call token caps, no “just in case” context stuffing. The output budget gets attention because it’s visible. The input budget is usually where the actual money goes.

None of this requires changing models, switching providers, or making bets on the next frontier release. It’s architectural work inside the pipeline you already have — work the per-token price chart has been letting people defer.

The teams that do it over the next two quarters will look like they got a 5–10× cost improvement from “AI getting cheaper.” The teams that don’t will look like AI got 3–4× more expensive while everyone else’s costs fell. Same providers, same models, same rate cards. Different shoe.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Opus 4.6 vs 4.7: The Real Cost of Incremental AI Improvements — The first shoe, on per-task cost drift between model versions
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Is Your Digital Brain the Light Saber of the AI Era?

Robert Matsuoka — Wed, 20 May 2026 12:02:03 GMT

Your Digital Lightsaber is your AI Memory

My Duetto colleague Jake Becker is sharp. He’s been ahead of AI adoption on our team — experimenting early, pushing for new tools, staying current. Last week he messaged me on Slack: “I wish I had my own CTO Assistant. Like what you have.”

I paused.

I have one. I’ve had several, in different forms, going back months. But I hadn’t said anything about it publicly.

That gap — between an AI-forward person who knows the tools and a practitioner who actually has the thing — is what this piece is about.

TL;DR

Off-the-shelf AI tools are capable but contextually blind. You still have to leave your domain to get help.
Serious knowledge workers — writers, researchers, historians — have always built their own knowledge systems. Never trusted a vendor to hold their material.
Coding is shifting toward what writing has always been: directing, curating, maintaining a living body of knowledge.
Building your own AI memory and search layer is the new rite of passage. The tool you build IS the skill being developed.
The latest versions of my own stack: trusty-memory and trusty-search — both open source, both installable today.

Everyone Wants One. Few Have One.

Jake’s comment revealed something I’d been taking for granted.

The major vendors are actively working on this — memory features, RAG pipelines, personalization layers. But those solutions are optimized for breadth, not for one person’s actual work across functions.

ChatGPT, Claude, Copilot — these are all capable. They’re also still contextually blind to you. You have to leave your environment, paste in context, explain your situation from scratch, then interpret the output back into your actual work. Vendors are working on it. But the solutions remain generic, and every session still starts from zero.

I wrote about this in March — Everyone Blamed Clawd Bot’s Execution. The Concept Was the Problem. The structural flaw of universal assistants isn’t fixable. They require you to leave your context to get help. What actually works is the opposite: your tools get assistant capabilities, and assistance comes to where your context lives.

Off-the-shelf tools haven’t solved this. They’ve gotten more powerful — better reasoning, longer context, faster inference — but they still don’t know your codebase, your decisions, your institutional history, your current sprint. They know a lot about the world, and very little about you.

Jake wanted my assistant. But what he actually wants is his assistant. The one that knows what he knows.

That’s a different problem entirely.

The job changed first.

Coding Is Becoming Writing

Practitioners feel it before analysts name it.

A few years ago, being a strong engineer meant writing a lot of code quickly and correctly. Today, with agentic AI coders at their disposal, the best engineers I watch spend their time directing, reviewing, specifying, and curating. The unit of work has moved up a level. Implementation is increasingly delegated. Judgment — about architecture, trade-offs, what to build and why — is the differentiator.

This is not what happens when automation replaces a skill. It’s what happens when a new discipline appears.

Writers have always worked this way. A novelist doesn’t produce words per minute as a primary metric. They produce decisions — what to say, in what order, with what emphasis. The words are the output of the decisions, not the work itself. What makes a writer productive over a career isn’t typing speed. It’s having a system: notes, research, accumulated material, patterns of thought that compound over years.

Some (typically senior) engineers are arriving at the same realization. Your value isn’t the code. It’s the judgment, the accumulated context, the knowledge of what was tried and why it failed. The question is whether that accumulates in your head alone — which doesn’t scale, and doesn’t survive a context switch — or whether it lives in a system.

Good engineers who learn to use their AI tools effectively generate better code — the stack amplifies judgment and accumulated knowledge. The gap isn’t between skilled and unskilled engineers in isolation; it’s between engineers who’ve wired their knowledge into their tools and those who haven’t. The knowledge is real in both cases. Only in one case does it compound.

The writers I’ve observed who sustain serious output over decades all have the same property: they know where things are. Their research is retrievable. Their earlier thinking is available to their current thinking. The system makes the person bigger than their working memory.

That’s the gap.

The AI Zettelkasten

Writers Don’t Trust Vendors With Their Material

Niklas Luhmann, a German sociologist working in the 1950s, produced 70 books and nearly 400 articles over his career. I haven’t written a book yet — but I have published over 200 articles at HyperDev. Worth naming the parallel. He credits his output to his Zettelkasten — a slip-box system of 90,000 interconnected index cards, each with a unique identifier linking it to related thoughts. Not a filing cabinet. A network of ideas that got richer with every addition.

And it’s worth asking: is that so different from what an AI knowledge system does today? The Zettelkasten was an analog precursor to what trusty-memory and trusty-search do programmatically — indexing ideas, linking related thoughts, surfacing connections that wouldn’t otherwise be visible. Luhmann was doing manually what these tools do automatically. Same architecture. New substrate.

He didn’t use a vendor product. He built a system that reflected how he thought. The architecture of his Zettelkasten was itself an expression of his intellectual method.

This isn’t a historical quirk. It’s a pattern. Serious knowledge workers have always built their own systems — commonplace books, research archives, private wikis. The reason is structural: the schema you design reflects how you think. That’s not something a product gives you. A product gives everyone the same schema.

Andrej Karpathy pointed at something similar last month with his LLM Wiki gist. His framing: use LLMs not just to write code, but to build and maintain a personal knowledge base. “Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.” Three folders, structured Markdown, a large context window. He concluded: “I think there is room here for an incredible new product.”

He’s right there’s room. I wrote about his framing in What’s In Your Second Brain? The product comment is where I’d push back. You can build tooling around the pattern. You can’t productize the schema. The schema is the moat — because it reflects how you think, not how a product manager thinks you think. The ones who get it aren’t waiting for a product.

The Lightsaber Rite of Passage

In Star Wars canon, a Padawan doesn’t receive a lightsaber. They build one.

The ritual is called the Gathering. Initiates travel alone to the Crystal Caves of Ilum. They have to find their kyber crystal — the crystal that’s attuned to them through the Force. The caves are shaped by the initiate’s own fears and insecurities. The crystal doesn’t go to the strongest or the fastest. It bonds with the person who confronts what’s in the way.

Then they build it themselves, guided by Professor Huyang.

You can’t buy this. You can’t inherit it. The construction is the training. The tool reflects the builder.

I’m not the first to reach for this metaphor in tech. But I think it lands differently now. Building your own AI memory and search layer isn’t just useful. It’s diagnostic. You can’t do it without confronting what you actually know, how you actually think, what deserves to persist and what doesn’t. The schema you design for your knowledge base is a statement about your mind.

The engineers I know who are operating at the highest level right now — CTOs, senior architects, tech leads at places moving fast — they all quietly roll their own. They don’t announce it. They just have it.

My Own Lineage

I’ve been building versions of this for months.

trusty-izzie was the first — a simple wrapper. Then ai-commander, a more structured approach to context management. Then open-mpm and claude-mpm, which was where I started thinking seriously about multi-agent orchestration. Then kuzu-memory, a graph-backed memory layer. Then mcp-vector-search, semantic search over my entire codebase.

Each iteration taught me something about what I actually needed. Not what I thought I needed. What the practice revealed.

This piece was drafted with a configured writing assistant — claude-mpm loaded with my publication style guide, my voice patterns, my article archive. That’s a saber too. Not a generic chat interface. A tool shaped around how I think and write, producing work I can actually publish rather than work I have to fix. The saber list keeps growing.

The latest two are the most capable tools I’ve built.

The Latest Sabers

trusty-memory is a machine-wide AI memory daemon written in Rust. It uses what I call the Memory Palace architecture — multiple named palaces, each for a different domain. Sub-5ms baseline retrieval on Apple Silicon. It runs as an MCP server for Claude Code, which means my assistant stores and retrieves memories automatically, across sessions, without me managing any of it explicitly.

cargo install trusty-memory

Available at crates.io/crates/trusty-memory.

trusty-search is a machine-wide hybrid code search daemon, also in Rust. Always-on, one install per machine. It combines BM25 lexical search with HNSW vector search (all-MiniLM-L6-v2 INT8) and a Knowledge Graph with 1-2 hop expansion, fused via Reciprocal Rank Fusion. It exposes an MCP server with 11 tools. Stdio and HTTP/SSE transports drop straight into Claude Code.

cargo install trusty-search

Available at crates.io/crates/trusty-search.

These tools, along a set of custom reporting pythons apps along with a custom Slack Bot I use to access the data remotely, comprise my digital brain.

These aren’t products I bought. These are tools I built, iterated, and use daily. They know my codebase the way a Zettelkasten knows a scholar’s intellectual territory — not because a vendor configured them, but because I did.

To be precise: trusty-memory and trusty-search are infrastructure utilities — the memory layer and the search layer. Building the actual assistant that uses them is a separate act of customization. That’s where the lightsaber metaphor completes: the kyber crystal is only part of it. The construction — what you build with the crystal — is the saber.

When Jake said he wished he had a CTO Assistant, this is what he was gesturing at. Not a prompt template. Not a workflow. A living knowledge layer that compounds.

The Right Question

Jake asked: “Can I get a CTO Assistant?”

That’s the wrong question. It assumes the thing is available off the shelf, and the task is finding and configuring it.

The right question is: “What would it take to build one that knows what I know?”

That question is harder. It requires confronting the shape of your knowledge, what’s worth persisting, how to structure retrieval. It’s uncomfortable in the same way the Crystal Caves are uncomfortable — not because the work is technically difficult, but because you have to be honest about what you actually have.

Not everyone needs to write Rust. The specific technology isn’t the point. The point is that the engineers asking the right question are already operating differently. They’re working like writers — maintaining a living body of knowledge, building systems that compound, treating their accumulated context as an asset rather than a liability.

Writers’ discipline has been creeping into engineering for a while. AI made it urgent.

If you’re waiting for a product to hand you the thing, you’re waiting for someone to build your lightsaber. It won’t be yours.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

What’s In Your Second Brain? — Karpathy’s LLM Wiki and the case for a compounding knowledge layer
Everyone Blamed Clawd Bot’s Execution. The Concept Was the Problem. — Why universal assistants are architecturally broken
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

What’s In Your Second Brain?

Robert Matsuoka — Mon, 04 May 2026 13:26:26 GMT

The modern CTO toolkit isn’t just apps and coding tools. The real differentiator is a custom knowledge layer — databases, search indices, memory graphs, behavioral instructions that compound over time. No product gives you this. You build it.

Andrej Karpathy gestured at something similar last month when he posted a GitHub Gist he called “LLM Wiki.” His framing: stop using LLMs just to write code, use them to build and maintain a personal knowledge base instead. “Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.” Three folders, structured Markdown, a large context window, a few Python scripts. No RAG, no vector database. He concluded with: “I think there is room here for an incredible new product.”

He’s right that there’s room. But the product comment is where I’d push back, and I’ll get to that. What Karpathy is describing isn’t a note-taking system. It’s a personal operational knowledge layer. For CTOs specifically, that layer needs to be broader than a personal wiki — it needs live organizational data, agent-connected search, and context that persists across months of decisions. No app hands you that.

TL;DR

Karpathy’s LLM Wiki shows the direction: LLMs as knowledge compilers, not just code generators
A modern CTO’s “second brain” is more than PKM — it’s live databases, custom agents, and contextual search across organizational data
When I joined Duetto as CTO, my custom toolkit let me synthesize a 150-person R&D org in weeks instead of months
The power isn’t Obsidian. It’s what you connect to it — MCP servers, search indices, knowledge graphs
Productizing this is theoretically possible and practically very hard, because the schema is the moat

The toolkit article got it half right

In What’s In My Toolkit: Claude Code and Family, I wrote about vanilla Claude Code’s core limitations: context evaporates, code search is keyword-based, memory doesn’t persist, execution is single-threaded. The tools I built — Claude MPM, mcp-vector-search, kuzu-memory — address each of those gaps.

But that article was about coding workflows. The real story is broader.

The same architecture that makes a coding session more effective — persistent memory, semantic search, specialized agents pulling from structured data — turns out to be extraordinarily useful for executive work. Understanding an organization, tracking decisions over time, querying data across systems, maintaining context across months of meetings and analysis. The toolkit I built for software development became the toolkit I used to onboard as a CTO.

That onboarding story is documented in detail elsewhere. Short version: I pointed a multi-agent framework at GitHub, JIRA, Slack, Confluence, and budget spreadsheets, and synthesized a 150-person R&D organization in the weeks before my start date. The difference between doing that with a chat interface versus a CLI-based orchestration layer with parallel agents and persistent memory wasn’t 2x or 5x. It was closer to 10x.

But the onboarding was just the starting gun. The second brain I assembled keeps compounding.

What’s actually in my second brain

Let me be specific. Because when people (now) hear “second brain” they usually think Obsidian vaults with color-coded tags and pretty Markdown files. That’s part of it. It’s the surface layer.

The actual power comes from what’s underneath.

The memory layer

kuzu-memory is a KuzuDB-backed knowledge graph that persists across every AI session. It stores learnings from conversations, code commits, decisions, patterns. When I start a new Claude Code session on a problem I’ve touched before, the context isn’t blank — it’s enriched with what was learned the last time.

This is the thing people underestimate. A project-specific memory that accumulates over months of work develops a kind of organizational intelligence you can’t replicate in a single conversation. It knows why a particular architectural decision was made. It knows that a vendor was evaluated and found lacking. It knows the terminology your team uses internally that differs from industry standard.

KuzuDB isn’t a product choice for its own sake — it’s graph-native, which means it handles relationships well. The connections between people, systems, decisions, and code are as important as the facts themselves.

The search layer

mcp-vector-search provides semantic search across all project files. Not keyword search — semantic search with AST parsing. When I ask “where is the analysis I did on contractor productivity last quarter,” it finds it even if the document never uses those exact words.

At Duetto, this covers everything in my CTO project: architecture records, meeting notes pulled from Granola, emails I’ve synthesized, analysis documents, planning artifacts. Months of accumulated context, all searchable in seconds. The underlying code intelligence for the engineering organization runs as a separate service — mcp-vector-search is for my working knowledge, not the codebase itself.

The databases

My CTO project has three:

cto.db — SQLite. Work classification, people analysis, contributor data, commit history. The operational database for running analyses and reports.
analytics.duckdb — DuckDB. OLAP queries and analytics. When I need to slice engineering output data in different ways or run something that would be painful in SQLite, it goes here.
duetto_knowledge.db — The RAG-queryable knowledge base backing a Flask web app for interactive exploration.

These aren’t a product I bought. They’re a schema I designed, built incrementally, and own completely. The schema reflects how I think about the organization, which is precisely why it’s useful.

The connectors

gworkspace-mcp handles Drive, Docs, Sheets, Gmail, Calendar, and more. I wrote my own rather than using the off-the-shelf options — Google’s first-party integration and Anthropic’s default both have significant tool coverage gaps. Mine exposes substantially more of the Workspace API surface and integrates transparently with Claude MPM, so agents can use Google Workspace tools without any special configuration at the call site.

Beyond Workspace: Notion API for product specs and planning documents. Extraction scripts for JIRA, Confluence, Slack, Datadog, and AWS. Each system outputs to as raw data, which feeds analysis pipelines that generate reports stored in a project directory.

For company-wide memory, two more tools: duetto-memory and duetto-directory. These handle shared organizational context — information that needs to flow between tools and across team members rather than staying in a single session. Memory persists within our VPC, encrypted to individual users’ OAuth keys. Not even our own IT has access to it. Context shared from Claude Code shows up in Claude.ai, and vice versa, without any manual sync.

The entire flow is queryable. From a single Claude session, I can ask about budget trends, team velocity, specific architectural decisions, or what a particular engineer has been working on for the last three months. Because it’s all in the same context-addressable system.

Obsidian as the front door

Yes, I use Obsidian. But it’s a front door, not the building. The vault holds my personal notes, research captures, and synthesized analysis. The Obsidian Web Clipper feeds raw material into the knowledge pipeline. Templates enforce consistent structure.

Karpathy’s insight about Obsidian as IDE is right in the narrow sense: it’s the interface you use to read and organize. But the interesting work happens outside it — in the databases, the agents, the search indices, the custom scripts.

CLAUDE.md files everywhere

The context layer isn’t just data. It’s also behavioral instructions.

Every major directory in my project has a CLAUDE.md. The root CTO project one is 400 lines of conventions, routing logic, document lifecycle rules, and architectural decisions. Every subdirectory has a more focused version. Every specialized agent has its own constraints.

These files are my second brain’s schema, expressed as instructions rather than data. A single routing rule — “if the prompt mentions meeting notes, save to projects/meetings/2026-W##/“ — sounds trivial. But it means twelve months of meeting notes accumulate in consistent, queryable locations rather than wherever an agent happened to save them. Multiply that by forty routing rules across fifteen subdirectories, and the entire corpus becomes navigable. The CLAUDE.md files are what make the databases useful. Without them, the data is just data.

Karpathy put it well: “You share the schema, not the code.” The schema is the valuable part. The schema is what compounds.

My schema took months to build. It will keep getting better. No product ships with the right schema for my organization, because no product knows what I know about how Duetto’s R&D works.

The productization question

Karpathy said there’s room for an incredible product. He’s not wrong about the gap. He might be wrong about the solution.

The structural problems with productizing a second brain:

Context compounds, products don’t. My system gets smarter with every commit, meeting, and conversation. A SaaS product serves thousands of customers and maintains no one’s specific context. The more I use my system, the wider the gap between it and any off-the-shelf alternative.

The schema is the moat. My knowledge architecture reflects how I think about engineering organizations. Someone else’s knowledge architecture would be different. Products that force their schema on you — and every product does — are imposing someone else’s way of thinking on your problem. That friction is small at first and grows over time.

Privacy is structural, not incidental. My databases contain org structures, salary data, performance patterns, vendor negotiations. Routing that through third-party infrastructure creates risk that’s practically impossible to contain. When I built duetto-memory for enterprise use, the entire stack stays within our VPC, with memories encrypted to individual users’ OAuth keys. Not even IT can read them. That level of isolation is nearly impossible to provide as a multi-tenant SaaS.

Some layers could be productized — the infrastructure, not the intelligence. A well-designed memory MCP with sensible defaults. Semantic search that works without configuration. Privacy-preserving graph storage you don’t have to host yourself. The plumbing.

The schema, the decisions, and the accumulated context can’t be productized. Those are yours. That’s the point — and it’s also why the product gap Karpathy sees will remain open even after someone tries to fill it.

Can anyone do this?

There’s an access problem here, and I’d be dishonest not to acknowledge it.

Building what I’ve described requires knowing Python well enough to write extraction scripts, understanding enough about graph databases to design a schema, and being comfortable with CLI-based tooling and MCP server configuration. Not every CTO has that background. Not every technical leader wants to spend weekends building personal infrastructure.

The irony is that the people who most need better organizational intelligence — executives without deep engineering backgrounds — are least equipped to build these systems. And the people who are most capable of building them are often less interested in the executive problems the systems could solve.

Tiago Forte, who wrote Building a Second Brain, has been making this point for years. His PARA method and CODE framework are accessibility layers — ways to make the underlying ideas approachable without requiring you to build a graph database. The methodology is sound. But it was designed for knowledge workers, not for CTOs running engineering organizations who need live data pipelines, not filing systems. Well-designed for whom?

Karpathy’s LLM Wiki is explicitly a system for someone comfortable writing Python and working with file systems. His Gist has code in it. That’s a feature for his audience and a barrier for everyone else.

What I’d watch for

A few trends that will determine whether this remains a DIY space or gets productized:

MCP as infrastructure. The Model Context Protocol creates a standard interface for exactly this kind of knowledge infrastructure. Memory servers, search servers, database connectors — they all expose the same interface to any compatible AI client. The ecosystem is growing fast. As more MCP servers mature, the configuration burden drops.

Searchable context beats raw window size. Karpathy argues for plain Markdown because ~400K words fit in a modern context window. That’s true, and the window is getting larger. But the more important shift is that structured, searchable context doesn’t have a ceiling. A well-organized knowledge base that spans years of meetings, decisions, and analysis delivers more than any single context window can hold — and the value scales with the quality of the organization, not the size of the model.

Local model quality. Karpathy runs Anthropic agents via Claude Code. But local model quality is improving fast. A system that uses the cloud API for synthesis and queries but runs a local model for routine indexing tasks would be significantly cheaper and more private. Not ready yet. Getting closer.

The product Karpathy thinks exists — if it gets built — probably looks like a well-designed local MCP server with clean configuration, sensible defaults, and a plugin ecosystem for connectors. Not a SaaS. Not a cloud database. Something you install and own.

The people who need it most will have already built their own before any product ships. And in the process of building it, they’ll have accumulated the one thing no product can give them: months of their own operational context, organized the way their own mind works.

That’s not a consolation prize. That’s the whole point.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

What’s In My Toolkit: Claude Code and Family — The coding layer of the stack
I Built a Coding Tool. Then I Used It to Onboard as CTO — Applying agent orchestration to organizational analysis
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

It’s The Harness, Stupid!

Robert Matsuoka — Mon, 13 Apr 2026 17:17:16 GMT

It’s The Harness, Stupid!

Why AI tool orchestration now matters more than foundation model quality

Author: Bob Matsuoka, CTO @ Duetto Research
April 6, 2026

TL;DR

Same-model testing reveals 0.82-point quality spread (3.93 to 4.75) and 7x efficiency differences—orchestration dominates outcomes
Market validation: Claude maintains 70% developer preference despite GPT-5.4 achieving model parity through superior harness quality
Reddit analysis confirms Codex efficiency gains come from orchestration improvements, not just model upgrades
Competitive advantage has shifted permanently from model superiority to ecosystem superiority

Bottom line: The harness era has begun. Choose tools based on workflow fit, not benchmark claims.

The $50B Model Myth

The AI industry has a fixation problem. Every week brings breathless announcements about parameter counts, training costs, and benchmark scores. “GPT-6 has 50 trillion parameters!” “Our model scored 94.7% on SWE-bench!” “We spent $2 billion on compute!”

Three converging pieces of evidence prove this approach is fundamentally wrong.

Evidence #1: I tested eight AI coding agents across five programming challenges. Four agents used identical Claude Sonnet 4.6 models. Quality scores ranged from 3.93 to 4.75—a 0.82-point spread on the same foundation model.

Evidence #2: GPT-5.4 achieved parity with Claude Sonnet 4.6 on coding benchmarks. Yet Claude maintains 70% developer preference through superior ecosystem quality.

Evidence #3: Reddit developer communities confirm Codex’s efficiency improvements come from orchestration architecture changes, not just model upgrades.

The harness matters more than the model. Choosing an AI coding tool is now primarily an engineering decision, not a model selection decision. The next competitive advantage isn’t bigger models—it’s better orchestration.

Evidence Pillar #1: The Smoking Gun Laboratory Data

The Bake-Off Setup

I designed five programming challenges ranging from 30-minute tasks to 8-hour full-stack builds:

Level 1-2: Simple scripts and basic applications
Level 3: API integration with Docker containerization
Level 4: Extensible data processing pipeline (architecture test)
Level 5: Full-stack web application with authentication

Eight agents competed: Claude Code, Claude MPM, Codex, Gemini CLI, Auggie, Qwen+Aider, DeepSeek+Aider, and Warp AI. Each received identical prompts. A panel of expert developers blind-reviewed all submissions across eight criteria: functionality, correctness, best practices, architecture, code reuse, testing, error handling, and documentation.

The Harness Advantage Data

Table 1: Same Model, Different Worlds

Four agents using identical Claude Sonnet 4.6 models. Quality scores from 3.93 to 4.75—a 0.82-point spread. claude-mpm finished in 45 minutes while warp took 313 minutes. Almost 7x longer for lower quality results.

The Scaling Pattern

The harness advantage compounds with complexity:

Levels 1-2: All agents performed similarly. Simple tasks don’t reveal orchestration differences.
Level 3: API integration and Docker setup separated agents that plan from those that code-and-fix. Clear gaps emerged.
Levels 4-5: Architecture and full-stack challenges broke most agents. Only well-orchestrated systems completed the complex workflows.

The pattern is clear: as complexity increases, harness quality becomes the primary determinant of success.

Evidence Pillar #2: Market Validation — GPT-5.4 Caught Up

Model Parity Achievement

February-April 2026 benchmarks confirm GPT-5.4 has achieved parity with Claude Sonnet 4.6:

Core Benchmarks:

SWE-bench Verified: GPT-5.4 ~80% vs Claude 79.6% (statistical tie)
SWE-bench Pro: GPT-5.4 57.7% vs Claude 43.6% (GPT leads complex problems)
Terminal-Bench: GPT-5.4 75.1% vs Claude ~65% (DevOps advantage)
Context handling: Both models feature 1M token windows

Yet Claude Still Dominates Through Harness Advantages

Despite achieving model parity, the competitive landscape tells the harness story:

Market Reality:

Developer preference: Claude 70% (superior workflow integration)
Enterprise share: Anthropic +4.9% MoM growth, OpenAI -1.5% decline
Revenue: Claude Code $2B ARR in 6 months

Even when models reach parity, harness quality determines adoption.

The Multi-Model Strategic Reality

Leading organizations aren’t choosing between models anymore—they’re deploying three-tier strategic architectures based on cost-performance optimization:

Tier 1: Daily Workhorse (60-70% of requests)

Claude Sonnet 4.6: $3/$15 per million tokens
High-volume development, routine coding tasks
95%+ of premium model quality at half the cost
Default choice for most enterprise development work

Tier 2: Specialized Operations (20-30% of requests)

GPT-5.4: $2.50/$15 per million tokens
Terminal operations, DevOps workflows, CI/CD debugging
75.1% Terminal-Bench score (10-point lead over competitors)
Inherited Codex’s terminal operation dominance

Tier 3: Premium Analysis (10-20% of requests)

Claude Opus 4.6: $5/$25 per million tokens
Complex reasoning, architectural decisions, high-stakes analysis
World leader in abstract reasoning (87.4% vs GPT-5.4’s 83.9%)
When cost justifies maximum capability

This confirms the core thesis: when models are “good enough,” teams optimize for strategic cost-performance fit, not raw capability or marketing claims.

Evidence Pillar #3: Community Validation — The Codex Orchestration Story

Reddit Confirms Orchestration Improvements

Reddit research explains Codex’s impressive efficiency results (42 minutes, 4.49 quality score). The evidence confirms improvements come from orchestration, not just model upgrades.

Architectural Evolution Evidence:

Workflow Efficiency Improvements:

Developers report queuing “4-5 Codex tasks before diving into manual work”
“2-3 completed PRs waiting for review” after a coffee break
P99 response time 45ms vs Copilot’s 55ms through better context management
Parallel processing capabilities that enable true background orchestration

Enterprise Orchestration Benefits:

The Community Strategic Deployment Pattern

Reddit developers now recommend different tools for different purposes:

Claude Code: Code quality and reasoning
Cursor: Daily coding integration
OpenAI Codex: Complex multi-agent workflows and long-horizon autonomy

This matches exactly what the market data predicted: teams use orchestrated tools strategically rather than seeking one universal solution.

The Harness Quality Ladder

Based on all three evidence pillars, I see four tiers of orchestration quality emerging:

Tier 1: Basic Wrappers

Simple API access, minimal context management
Examples: Raw ChatGPT interface, basic API wrappers
Limitation: No file coordination, poor context retention

Tier 2: Workflow Tools

File awareness, some context management
Examples: GitHub Copilot, basic IDE extensions
Capability: Single-file optimization, limited cross-file understanding

Tier 3: Orchestrated Systems

Multi-file coordination, workflow integration
Examples: Cursor, Claude Code, well-configured aider
Advantage: Understands project structure, handles complex tasks

Tier 4: Agentic Frameworks

Multi-agent coordination, planning, verification
Examples: claude-mpm, advanced orchestration systems
Power: Full project lifecycle, quality assurance, architectural thinking

The performance cliff between tiers is exponential, not linear. Bad orchestration can make great models perform poorly; great orchestration can make good models perform excellently.

Academic and Industry Validation

This isn’t just empirical observation. Multiple 2026 research papers and industry studies support the harness thesis:

Academic Consensus:
The arXiv paper “Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems” shows that domain-tuned models with better orchestration achieve superior cost-normalized accuracy despite using smaller base models.

SWE-bench data reveals the same pattern. Cursor, Claude Code, and Auggie all use similar base models yet score between 50.2% and 55.4%, while the raw model score is only 45.9%. The 5.9-point improvement comes entirely from better context retrieval and agent design.

Business Reality Check:
Enterprise adoption surveys show a clear shift in CTO priorities. “Model performance” is dropping in tool evaluation criteria, replaced by governance, integration quality, and workflow fit. As one 2026 McKinsey report put it: “CTOs are realizing their biggest bottleneck isn’t model performance—it’s governance.”

What This Means for Engineering Leaders

Stop Optimizing for Benchmarks

The old procurement mindset was model-first: “We need access to GPT-6 for competitive advantage.” The new reality is that benchmark performance doesn’t predict practical utility. SWE-bench scores don’t tell you whether a tool will integrate with your existing workflow, handle your codebase size, or recover gracefully from errors.

Start evaluating harness quality:

Context management: How well does it understand your project structure?
File coordination: Can it work intelligently across multiple files?
Error recovery: Does it handle failures gracefully or require constant babysitting?
Workflow integration: How does it fit with your team’s existing development process?

Budget for Orchestration Quality

The three evidence pillars show that investing in better orchestration yields measurable returns:

Quality per minute: claude-mpm achieved 4.75 quality in 45 minutes; warp achieved 3.94 in 313 minutes
Market validation: Claude maintains dominance despite model parity through superior developer experience
Enterprise results: 70% more PRs, 50% faster code review, 67% faster turnaround

The ROI case for harness investment is clear and quantifiable.

Team Productivity Focus

Tool choice impacts your entire development pipeline. The 7x speed difference between well and poorly orchestrated tools using the same model means tool selection is a productivity multiplier, not just a capability decision.

Better tools also reduce onboarding time and increase adoption rates. A tool that works reliably gets used; one that requires constant troubleshooting gets abandoned.

The Competitive Landscape Evolution

Codex Deserves Recognition

Codex’s performance has significantly improved. At 42 minutes for all five levels with a 4.49 quality score, it achieved by far the best efficiency in my study. GPT-5.4+ combined with the orchestration improvements OpenAI made represents a compelling package. The Reddit research confirms this wasn’t just a model upgrade—it was an architectural evolution toward multi-agent orchestration.

Claude Code’s Harness Moat

While Claude Code performed well (4.53 quality score), the market validation shows its true strength: ecosystem superiority. Despite GPT-5.4 achieving model parity, Claude maintains 70% developer preference through superior harness quality. This is exactly what sustainable competitive advantage looks like in the post-parity era.

The Multi-Model Future

All evidence points to the same conclusion: the era of picking one model is over. Leading organizations deploy three-tier cost-performance architectures, optimizing for specific strengths rather than seeking universal solutions.

Real enterprise case studies validate this pattern:

The successful pattern: Sonnet for volume, GPT-5.4 for DevOps, Opus for complexity.

The Token Economics Reality

claude-mpm achieved the highest quality score (4.75) but used 87 million tokens versus codex’s 120K. This looks expensive until you consider the output: 262 comprehensive tests (vs codex’s 32), complete documentation, 100% verification rates, and multi-file coordination (note: this was also a wake-up call to me to focus on token optimization, current version is much stingier)

The 700x token multiplier isn’t overhead—it’s the cost of work a solo agent skips. Orchestration doesn’t waste tokens—it spends them on comprehensive deliverables.

The optimization question: Could you achieve 80% of the quality benefits at 30% of the token cost? The opportunity isn’t eliminating orchestration—it’s finding the minimal viable team size for maximum impact.

The Vendor Bias Problem: “Opus for Everything”

Boris Cherny, the Claude Code lead, recently advocated for using “Opus for everything.” This perfectly illustrates the disconnect between vendor recommendations and practical deployment reality.

Only someone working for Anthropic can say that.

When your employer provides unlimited access to premium models, of course you’d recommend the most expensive option for every task. But real organizations operating with P&L responsibility make strategic decisions about when premium capability justifies premium cost.

This vendor bias actually validates the multi-model thesis:

Vendors say: “Use our premium model for everything”
Users do: Strategic model selection based on task complexity and budget constraints
Market reality: 70% prefer Claude for daily coding (cost/speed), GPT-5.4 for complex reasoning (quality ceiling)

Cherny’s comment inadvertently proves that cost-conscious orchestration is the real competitive battleground. Companies that figure out optimal model routing—not maximal model usage—will have sustainable advantages.

The vendors push premium. The market chooses strategically. The harness makes both possible.

The Future: Welcome to the Harness Era

What Changes for Developers

Tool selection framework:

Workflow fit: Does it match how your team works?
Integration quality: Plays well with existing tools?
Reliability: Can you trust it with production code?
Model quality: Fourth priority

What Changes for the Industry

Foundation models are becoming commodities. Differentiation shifts to integration, context management, and user experience. The next unicorns will be harness companies, not model companies.

Major funding flows to orchestration companies. Enterprise procurement evaluates integration first, model second.

The Competitive Moat Shift

The old game was: train bigger models, claim benchmark superiority. The new game is: build better orchestration, solve real workflow problems. Model access becomes a utility; workflow mastery becomes the moat.

Practical Recommendations

For CTOs and Engineering Leaders

Audit orchestration quality: Test tools with your actual codebase for 2-week trials
Budget 60/40: Spend more on harness development than model subscription fees
Measure real metrics: Track pull request velocity and code review time, not benchmark scores
Evaluate integration first: How well does it fit your existing CI/CD pipeline?

For Developers

Test with real projects: Spend 2 days with each tool on actual work before deciding
Learn orchestration patterns: Context management and file coordination matter more than prompts
Invest in mastery: The 7x efficiency difference justifies significant learning time
Ignore marketing claims: Model access means nothing without good orchestration

For the AI Industry

Build for workflow integration: Solve real development pipeline problems
Measure practical utility: Developer retention and task completion rates beat benchmarks
Focus on context management: Multi-file coordination is the real competitive moat

Conclusion: The Questions That Matter Now

The old question was: “What’s the best model?”

The new question is: “What’s the best harness for my team’s workflow?”

Three evidence sources prove we’ve crossed a threshold: foundation models are “good enough,” and orchestration quality now dominates outcomes. Laboratory testing, market validation, and community confirmation point to the same reality.

The foundation model is the engine. The harness is the car. The best engine in the world won’t get you anywhere without wheels.

The harness era has begun. Drive accordingly.

Bob Matsuoka is CTO at Duetto Research and creator of Claude MPM, one of the agents evaluated in this study. All evaluation data and methodology are available at github.com/bobmatnyc/ai-coding-bake-off for reproducibility.

Appendix: Complete Results Data

Quality Scores by Criterion

GPT-5.4 vs Claude Sonnet 4.6 Market Data

SWE-bench Performance:

SWE-bench Verified: GPT-5.4 ~80% vs Claude 79.6% (statistical tie)
SWE-bench Pro: GPT-5.4 57.7% vs Claude 43.6% (GPT advantage on complex problems)
Terminal-Bench: GPT-5.4 75.1% vs Claude ~65% (GPT DevOps advantage)

Market Metrics:

Developer preference (daily coding): Claude 70%
Enterprise market share: Anthropic +4.9% MoM, OpenAI -1.5% MoM
Claude Code revenue: $2B ARR in 6 months

Methodology Notes

Laboratory data: Single run evaluation with disclosed author bias
Market data: Cross-validated across 15+ authoritative sources
Community research: Reddit analysis across 8+ developer subreddits
Statistical confidence: Mean inter-reviewer deviation of 0.216 points
Reproducible: All data and prompts available in public repository

I Met a Movie Star Mila Jovovich — As a Coder

Robert Matsuoka — Sat, 11 Apr 2026 12:31:14 GMT

I didn’t expect to meet Mila Jovovich through a GitHub issue.

But there I was last week, deep-diving into her AI memory framework called MemPalace, when I discovered something remarkable: the “Resident Evil” and “Fifth Element” star had created one of the most talked-about AI memory systems of 2026. And she’d done it using Claude Code, the same AI-assisted development environment I use daily.

More remarkably, when I found critical bugs in her benchmark methodology, she responded directly through her Claude Code workflow, acknowledging the issues and implementing fixes. Not through a PR team or engineering intermediaries — Mila herself, using AI-assisted development to debug complex memory retrieval algorithms at 9 AM on a Thursday.

This isn’t a story about a celebrity coding stunt. It’s about something much more profound: we’ve entered an era where outcomes and features drive development, not the technical limitations of writing code.

The MemPalace Phenomenon

In April 2026, Mila Jovovich and developer Ben Sigman released MemPalace, an open-source AI memory system that immediately went viral. Within 48 hours, it had over 23,000 GitHub stars. The system claimed to achieve the first perfect score on the LongMemEval benchmark, scoring 96.6% raw recall.

The project represents something unprecedented: a free, locally-running memory system that rivals expensive cloud alternatives like Mem0 ($19-249/month) and Zep ($25+/month). It uses the “memory palace” technique — a classical memory method dating back to ancient Greece — implemented through ChromaDB and SQLite, with zero ongoing API costs.

The technical architecture includes basic Claude Code integration (save hooks every 15 messages and before context compression) and 24 tools via the Model Context Protocol (MCP), making it compatible across multiple AI platforms.

The duo had spent months building it using Claude Code’s AI-assisted development environment. As Sigman noted, he provided “the engineering chops” while Jovovich drove the architectural vision.

When Audits Meet AI-Generated Code

That’s when things got interesting.

As someone who works extensively with AI memory systems — I maintain KuzuMemory, a graph-based memory framework — I was naturally curious about MemPalace’s benchmark methodology. The claimed 96.6% recall rate was extraordinary, especially for a system running entirely locally.

So I dove in.

What I found were several methodological issues that fundamentally undermined the headline numbers. The benchmark adapter was discarding assistant turns in conversation history, causing systematic under-recall on certain question types. More critically, the benchmark wasn’t actually testing MemPalace’s core functionality — it was primarily testing ChromaDB’s raw vector search capabilities.

I filed Issue #242 documenting the assistant turn bug, and Issue #214 showing that the 96.6% score was essentially a ChromaDB score, not a MemPalace score.

Mila’s response was immediate and technically sophisticated:

“Hey @bobmatnyc — I’ve taken a look and ran it through CLI. This is a real bug and it’s urgent. You caught that benchmarks/longmemeval_bench.py at lines 189-190 builds each session’s indexed document by concatenating only user role turns... Fix priority: this must land before any public benchmark re-run.“

She didn’t deflect or dismiss. She debugged the issue herself, identified the exact lines of code causing the problem, explained the downstream impact on other benchmarks, and outlined a detailed fix plan including regression tests.

This wasn’t PR speak. This was an AI-assisted developer engaging seriously with technical criticism.

The Democratization Shift

This interaction crystallized something profound about our current moment in software development.

We’re witnessing the emergence of a new class of builders: technically-minded individuals who understand software conceptually but may not have traditional coding backgrounds. AI-assisted development tools like Claude Code, GitHub Copilot, and Cursor have lowered the implementation barrier to the point where vision and domain expertise matter more than syntax mastery.

Mila Jovovich exemplifies this shift perfectly. Without formal technical education (she left school in 7th grade for modeling), she spent months intensively learning AI-assisted development through Claude Code starting in late 2025. She understood the conceptual framework of memory palaces deeply enough to architect a sophisticated system. Her collaboration with Ben Sigman — CEO of Bitcoin lending platform Libre Labs, who provided the engineering expertise while she drove architectural vision — represents a new model of software development where domain knowledge and AI tool fluency can substitute for traditional programming backgrounds.

The fact that a movie star can release a technically competent, widely-adopted memory framework isn’t a commentary on coding getting easier (though it has). It’s about software development becoming more accessible to domain experts and visionaries who previously couldn’t bridge the implementation gap.

What MemPalace Gets Right

Despite the benchmark issues I uncovered, MemPalace demonstrates genuine technical sophistication. The memory palace metaphor isn’t just marketing — it’s a thoughtful architectural choice that makes AI memory systems more intuitive and debuggable.

The system includes elegant features like per-agent memory “wings” that prevent cross-contamination between different AI assistants. The Claude Code integration hooks are well-designed, automatically triggering memory saves at logical conversation boundaries. The MCP implementation is clean and follows established patterns.

Most importantly, the project tackles a real problem: most AI memory systems are either expensive cloud services or complex local installations. MemPalace provides a middle path that’s both free and relatively easy to deploy.

Through my testing and integration experiments, I learned techniques that improved my own KuzuMemory system. The competitive analysis forced me to think more carefully about memory organization patterns and retrieval strategies. This kind of cross-pollination benefits the entire ecosystem.

The Validation Requirement

But the benchmark controversy highlights a crucial point: democratized software development still requires traditional validation methods.

AI-assisted coding tools excel at implementation but can perpetuate subtle conceptual errors throughout a codebase. The MemPalace benchmark issues weren’t obvious bugs — they were methodological problems that required domain expertise to identify.

This creates an interesting dynamic: AI tools enable rapid development by non-traditional developers, but peer review by experienced practitioners becomes even more critical. The community response to MemPalace’s inflated benchmarks wasn’t hostile — it was collaborative debugging at scale.

Mila’s willingness to engage directly with technical criticism and implement fixes demonstrates the right approach. The democratization of software development doesn’t eliminate the need for technical rigor; it distributes that rigor across a broader community.

The Harness Thesis Validated

This story perfectly validates what I call the “harness thesis” — that we’ve entered an era where AI tool ecosystems matter more than underlying model capabilities.

MemPalace succeeded not because Mila wrote perfect code from scratch, but because she effectively orchestrated Claude Code to implement her vision. The system’s value comes from its architectural choices, integration quality, and user experience — not from novel algorithmic breakthroughs.

Similarly, my ability to audit and improve the system came not from superior coding skills, but from having developed complementary expertise with memory systems and benchmark methodology. The collaboration that emerged — distributed across GitHub issues, with contributors from multiple backgrounds — represents the new model of software development.

We’re not just building different software; we’re building software differently.

Meeting Mila Through Code

In the end, I did meet Mila Jovovich — through our AI Agents, lines of Python code, GitHub issues, and technical discussions about memory retrieval algorithms, mediated by our respective Claude Code workflows. Not the meeting I would have predicted, but somehow more meaningful than a typical celebrity encounter.

She embodies a new archetype: the technical visionary who uses AI tools to implement sophisticated ideas without traditional programming backgrounds. Her willingness to engage with criticism and continuously improve the system demonstrates the collaborative spirit that makes this new era of development possible.

The future of software isn’t just about better AI models or more powerful tools. It’s about enabling more people with domain expertise and creative vision to participate in building the systems that shape our digital world.

And sometimes, that means meeting your childhood movie star idol in a GitHub issue thread, debugging memory palace algorithms together.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

It’s The Harness Stupid — Why AI tool ecosystems matter more than model capabilities
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Referenced Links:

The Software Factory is the Next Big Challenge

Robert Matsuoka — Wed, 08 Apr 2026 12:30:16 GMT

The Software Factory is the future of software development

Stripe engineers send Slack messages that automatically become production code. Not suggestions. Not drafts. Production code merged into their main branch, supporting over a trillion dollars in annual payment processing.

Their “Minions” system generates 1,300 pull requests per week with zero human-written code. Fire-and-forget automation from conversation to deployment. While the rest of us debate whether AI can write good code, Stripe has built a software factory that produces enterprise-grade applications at scale.

The software factory isn’t a future concept. It’s a present reality, and it represents the next fundamental challenge for engineering organizations.

What We’re Building at Duetto

I’ve been thinking about this a lot lately. At Duetto, we’re exploring what a software factory could look like for hospitality technology. Not because we want to eliminate developers, but because we’re hitting the limits of traditional development approaches for domain-specific applications.

Our challenge isn’t just writing code—it’s translating complex hotel revenue management requirements into software that works reliably across thousands of properties with different systems, data formats, and business rules. The cognitive load of keeping all these variations in mind while building features is becoming unsustainable.

What if we could describe what we need in something like our APEX specifications, and have the system generate not just code, but complete deployments? Kubernetes instances running Claude Code agents, database migrations, monitoring setup, the whole stack configured for that specific use case.

The goal isn’t replacing our engineering team. Our developers should be solving revenue optimization algorithms and building domain-specific integrations, not configuring YAML files for the hundredth deployment variation.

The Stripe Blueprint

Stripe’s Minions reveal what a production software factory actually looks like when you strip away the hype and focus on what works.

Five-Layer Pipeline: Their system transforms Slack messages into production-ready pull requests through a structured pipeline. Not magic—engineering discipline applied to automation.

Sandboxed Execution: Every agent runs in isolated containers with codebase checkouts. They can’t access production systems, can’t cause cascading failures, can’t break things outside their designated scope. The walls matter more than the model.

Surgical Tool Selection: Their Model Context Protocol provides access to hundreds of internal tools, but agents get intelligently prefetched access to only the ~15 tools relevant to their specific task. Not everything available—the right things available.

One-Shot Optimization: Instead of conversational back-and-forth, their agents are optimized for well-defined work that completes in a single execution. Better latency, lower costs, more predictable outcomes.

The results speak for themselves: 1,300 PRs weekly, zero human-written code in merged changes, supporting their entire payment infrastructure. This isn’t a pilot program. This is their production development workflow.

The Broader Software Factory Landscape

Stripe isn’t alone in building these systems, just the most public about their approach.

Netflix has their federated developer console integrating dozens of tools into a single unified experience. Spotify’s Backstage holds 89% market share among internal developer platforms, reducing time-to-tenth-pull-request by 55% for new developers.

The open source ecosystem is catching up quickly. OpenHands provides a model-agnostic platform for cloud coding agents with $18.8M in Series A funding. CodeT5 handles multi-language code generation. GitHub Copilot Enterprise is expanding beyond code completion into full workflow automation.

Major cloud providers are also building comprehensive platforms. Microsoft’s GitHub Copilot Workspace, Google’s Duet AI for developers, and Amazon’s Q Developer all represent enterprise-grade attempts at software factory capabilities.

According to Gartner, 80% of large engineering organizations now have dedicated platform teams. The question isn’t whether software factories are coming—it’s whether your organization will build one or buy one.

What a Proper Software Factory Requires

Building a software factory isn’t just about connecting AI tools to deployment pipelines. Based on what’s working at Stripe and emerging patterns across the industry, here are the essential components:

Artifact Response Systems

Your factory needs to respond to structured specifications and generate complete deployments. At Duetto, this might mean taking an APEX specification for a new revenue optimization feature and producing:

Kubernetes deployment configurations
Database migration scripts
Monitoring and alerting setup
Load testing scenarios
Documentation

The system should handle the entire deployment lifecycle from specification to running production service, not just generate code that someone has to manually deploy.

Strategic Human Review Checkpoints

Notice I said strategic, not comprehensive. Stripe’s fire-and-forget model works because they’ve identified the specific points where human judgment adds value without blocking automation.

For enterprise applications, you need checkpoints at:

Specification validation: Do the requirements make business sense?
Security review: Are access patterns and data handling appropriate?
Integration testing: Does this work with existing systems?
Production readiness: Are monitoring and rollback capabilities sufficient?

The key is making these gates fast and decisive, not bureaucratic approval processes that defeat the purpose of automation.

Scaffolding for Error Detection

Your factory will produce broken code. That’s not a bug—that’s reality. The difference between a prototype and a production system is sophisticated error detection and recovery.

This means:

Isolated execution environments where failures can’t cause broader damage
Automated testing and iteration when initial attempts fail
Multi-layer validation before anything reaches production
Comprehensive rollback capabilities for when something gets through anyway

Stripe’s sandbox architecture is brilliant because it lets agents fail safely while learning from those failures to improve future attempts.

Success Criteria Parameters

Your factory needs to know what success looks like for each type of work. Not just “the code compiles,” but measurable business outcomes.

For a hospitality feature, success might mean:

Performance benchmarks met under load
Integration tests pass with five different PMS systems
Revenue impact measurable within 30 days
Zero customer-facing errors in the first week

Define these criteria upfront, build them into your validation pipeline, and let the factory optimize for actual business value rather than technical metrics alone.

Cost Tracking and Optimization

AI-powered development isn’t free. You need visibility into the computational costs, tool usage, and human review time for each generated system.

Stripe optimizes for this explicitly—their one-shot agents cost less than conversational approaches, their surgical tool selection reduces context costs, their automated testing prevents expensive human debugging cycles.

Track these metrics from day one. The difference between a cost-effective software factory and an expensive experiment is usually found in the operational details.

Deployment Models

Your factory needs sophisticated understanding of how to deploy different types of applications. Golden Path workflows that codify best practices, environment promotion strategies that reduce risk, and rollback procedures that restore service quickly when things go wrong.

This is where domain expertise becomes critical. A generic software factory might know how to deploy a web service, but does it understand the specific requirements for hospitality payment processing, guest data privacy, and integration with property management systems?

The Duetto Context

At Duetto, we’re thinking about how a software factory could handle the complexity of hospitality technology. Our domain has unique challenges:

Data Integration Complexity: Every hotel uses different systems with different data formats. A software factory needs to understand these variations and generate appropriate integration code.

Regulatory Requirements: Guest privacy, payment processing, accessibility compliance. The factory needs to embed these requirements into everything it produces.

Performance Characteristics: Revenue management systems need to process pricing updates in near real-time across thousands of rooms and rate plans. The factory needs to optimize for these specific performance patterns.

Operational Constraints: Hotels can’t afford downtime during peak booking periods. Deployment strategies need to account for hospitality business cycles.

We’re not trying to build a general-purpose software factory. We’re exploring how to build one that deeply understands our domain and can produce applications that work reliably in hospitality environments.

The Reality Check

Building a software factory is hard. Not because the technology doesn’t exist—Stripe proves it does—but because the organizational challenges are substantial.

ROI Demonstration: You need to show measurable productivity improvements and cost savings. “The AI is impressive” isn’t sufficient justification for the investment required.

Security and Compliance: Automated code generation that touches customer data or payment systems requires additional security layers and audit capabilities.

Developer Workflow Changes: Your engineering team needs to learn new ways of working. Some will embrace it, others will resist. Change management is as important as the technical implementation.

Quality Assurance Evolution: Your QA processes need to evolve from testing human-written code to validating AI-generated systems. Different failure modes, different testing strategies.

Integration Complexity: Your factory needs to work with existing systems, databases, APIs, and workflows. The harder the integration challenge, the longer the implementation timeline.

These aren’t reasons to avoid building a software factory. They’re reasons to approach the project with realistic expectations and proper preparation.

Looking Forward

The trajectory is clear. Software factories are moving from experimental to mainstream, with proven systems operating at enterprise scale and standardized architecture patterns emerging across the industry.

The question for engineering leaders isn’t whether this transformation will happen. It’s whether your organization will be an early adopter that shapes how software factories work in your domain, or a later adopter that implements patterns developed by others.

At Duetto, we’re betting on being early. Not because we want to be on the cutting edge for its own sake, but because the companies that figure out domain-specific software factories first will have a significant competitive advantage in application development speed and quality.

The software factory represents the next evolution of platform engineering. The organizations that master it will build better software faster than those that don’t.

The challenge isn’t technical anymore. It’s organizational, strategic, and operational.

The question is: Are you ready to build one?

About this analysis: This piece draws from comprehensive research on production software factory implementations, including detailed analysis of Stripe’s Minions architecture, enterprise platform engineering initiatives, and emerging open source solutions. The author is exploring software factory applications for hospitality technology at Duetto.

About the author: Bob Matsuoka is Chief Technology Officer at Duetto and creator of Claude MPM (Multi-agent Project Manager). He has implemented AI-assisted development workflows across enterprise engineering teams and writes about the practical realities of AI integration in software development at HyperDev.

Is The Claude Code Team Moving Too Quickly?

Robert Matsuoka — Mon, 06 Apr 2026 12:30:41 GMT

On March 31, 2026, Anthropic accidentally shipped their entire Claude Code source—512,000 lines of TypeScript—in an npm package. What followed was perhaps the most intense technical autopsy in AI history. The verdict? Mixed, and revealing.

The criticism has been swift and pointed. A 5,594-line file with a single 3,167-line function sporting 12 levels of nesting. Regex-based frustration detection looking for “wtf” and “shit”. A quarter million wasted API calls per day from a three-line bug. As one critic put it: “A multi-billion-dollar AI company is detecting user frustration with a regex.”

But before we pile on, we need to ask: What does “good code” even mean when you’re building client-side LLM applications?

The Unprecedented Challenge

Claude Code isn’t your typical software. It’s a client-side application that orchestrates conversations with large language models, manages context across sessions, and attempts to maintain coherent state while working with fundamentally non-deterministic systems.

This creates problems that traditional software engineering practices weren’t designed for:

Context management: Handling arbitrarily long conversations that exceed model limits
Failure recovery: When your core computation is a 20% failure-rate API call
State synchronization: Keeping UI, conversation history, and model context aligned
Dynamic adaptation: Code that needs to adapt to changing model capabilities

The leaked source reveals sophisticated solutions to these problems: a three-layer memory architecture, anti-distillation mechanisms, dual parser systems for safety. The engineering is ~~genuinely~~ impressive, even if the implementation is sometimes ugly.

The Meta-Problem: AI Writing AI

Claude Code was partially written by Claude Code. This represents the first documented case of a large-scale AI tool generating significant portions of its own source code—not just incremental improvement, but a categorical change in development methodology that creates unprecedented quality control challenges when AI-generated code scales beyond human review capacity.

When AI generates code at scales that exceed human review capacity, traditional quality control breaks down. That 3,167-line function? Probably not written by a human. The 12 levels of nesting? Algorithmic patterns, not human design choices.

This is the real story: We’re witnessing the first major autopsy of self-bootstrapping AI tooling.

Deterministic vs. LLM Code: Different Standards Apply

I’ve been thinking about this distinction a lot lately in my work with Claude MPM, an open-source multi-agent code generation framework built on Claude Code that coordinates specialized AI agents for software development workflows. When you’re building traditional, deterministic software, all the usual rules apply. Clean functions, clear abstractions, maintainable architecture. Use your normal code analysis tools.

But when you’re building LLM-integrated systems, the rules change:

Failure is the default: Your core operations fail 20% of the time
Context is expensive: Every token counts toward limits
Behavior is emergent: The system does things you didn’t explicitly program
Adaptation is constant: Model capabilities change monthly

In this world, a 5,594-line file might be ugly, but if it successfully manages complex failure recovery across multiple conversation threads, it might also be correct.

The Code Analysis Checkpoint Strategy

This is where I’ve found success with my recent updates to the code analyzer in Claude MPM. The analyzer utilizes mcp-vector-search for comprehensive codebase analysis, providing AST-based semantic search, full-text search capabilities, and knowledge graph construction for architectural pattern detection. Instead of trying to prevent AI from generating messy code (impossible), I focus on regular refactoring and analysis checkpoints.

The analyzer has gotten very good at catching two specific issues:

Drift: When AI-generated code slowly diverges from intended architecture
Bloat: When generated solutions become unnecessarily complex over time

I make a point to run these checkpoints regularly, treating them as essential maintenance rather than optional cleanup. It’s like running cargo clippy or eslint, but for AI-generated architectural decisions.

The key insight: AI code needs different kinds of maintenance than human code.

Outcome-Based Generation: Does It Work?

Here’s my perhaps controversial take: If Claude Code successfully helps developers ship better software faster, then the messy internals might not matter as much as we think.

The leaked code reveals a system that:

Handles millions of conversations per day
Maintains context across arbitrarily long sessions
Provides sophisticated memory management
Implements multiple safety layers
Delivers a $2.5 billion ARR product experience

Is the implementation elegant? No. Does it work? Apparently, yes. Because we can observe/measure what it’s building completely independently of what built it.

This doesn’t excuse basic engineering failures (that .npmignore mistake was embarrassing). But it does suggest we need new frameworks for evaluating AI-generated systems.

The Scaffolding Solution

Rather than trying to make AI generate perfect code, we can scaffold around the inevitable messiness:

Automated refactoring checkpoints: Regular cleanup of AI-generated bloat
Architectural constraints: Guard rails that prevent the worst patterns
Outcome validation: Testing that focuses on behavior over implementation
Human oversight: Strategic points where humans validate AI decisions

This is the approach I’ve been taking with Claude MPM, and it’s proven remarkably effective. Let the AI generate messy-but-functional code, then use tooling to clean it up systematically.

What This Means for the Industry

The Claude Code leak represents a watershed moment. It’s our first real look at what happens when AI tools build themselves at scale.

The criticism is valid—basic engineering discipline matters, even in AI systems. A missing .npmignore file is inexcusable for a billion-dollar product.

But the deeper question is whether we’re applying the right standards. Traditional code quality metrics may not capture what actually matters for AI-integrated systems.

Moving Forward

Anthropic probably is moving too quickly in some ways. The leak revealed security vulnerabilities, competitive intelligence losses, and quality control failures that suggest inadequate human oversight.

But they’re also pioneering entirely new categories of software. The problems they’re solving—context management, failure recovery, human-AI collaboration—don’t have established best practices yet.

The real lesson isn’t that AI-generated code is inherently bad. It’s that we need new practices for building, reviewing, and maintaining systems that exceed human comprehension scales.

The question isn’t whether Claude Code’s internals are messy. It’s whether we can build better scaffolding around AI-generated systems to catch the problems that matter while accepting the messiness we can’t avoid.

The Claude Code team probably needs to slow down on the basics—security, testing, deployment hygiene. But they’re moving fast on problems that genuinely require speed to solve before competitors do.

That’s a nuanced position in an industry that loves simple takes. But nuance is what the moment requires.

What do you think? Are we being too hard on AI-generated code, or not hard enough? Share your thoughts in the comments.

About this analysis: This piece draws from extensive technical analysis of the March 31, 2026 Claude Code source leak, including community responses, security assessments, and business impact analysis. The author maintains active development projects using AI-assisted coding tools and has direct experience with the challenges discussed.

Moving Past the 10-Tab Workflow

Robert Matsuoka — Wed, 01 Apr 2026 12:32:10 GMT

From Tab Chaos to Autonomic Orchestration

I’m looking at my (iTerm) terminal right now. Ten tmux sessions. Each session holds a different project context—one monitoring CI failures, another handling a code review, a third debugging a production issue.

This is the reality of modern agent development work.

TL;DR

Multi-session reality: Power users average 8-12 terminal sessions; most work involves modification, bug response, and PR handling—not new code generation
Natural workflow origination: Future systems trigger from product team actions, CI failures, and automated events rather than human prompts
Orchestration evolution: From human-orchestrated agents to orchestration-of-orchestrators where prime coordinators are non-human
Production examples: Stripe’s Minions (1,300 PRs/week), GitLab’s Duo Agent Platform, Meta’s REA demonstrate hierarchical agent orchestration
Architecture shift: Claude Code’s SDK model enables workflow-driven development through persistent, context-aware agent orchestration

The 10-Tab Reality

According to recent developer workflow studies, tmux has become the standard for AI-assisted development, with persistent sessions solving the context-switching tax. The productivity advantage isn’t the multiplexing—it’s the persistence. Projects become environments you step in and out of rather than things you open and close.

But here’s what the productivity tutorials miss: most of those tabs aren’t generating software.

My current session breakdown:

3 sessions: non-coding -- my CTO knowledge base (currently analyzing our Sumo use), a writing assistant, and our Duetto product management framework
4 sessions: coding - various internal tools and MCP connectors
2 sessions: coding - new projects
1 session: code review

The 8:2 ratio holds across most senior developers I’ve observed. Most development work involves responding to existing systems, not creating new ones.

This distribution points toward something significant: the future of development orchestration isn’t human-initiated.

Beyond Prompt-Driven Development

Claude Code’s new SDK architecture reflects this reality. Instead of starting with human prompts, work originates from natural workflow events:

Product team creates ticket → Implementation specification generated
CI pipeline fails → Diagnostic agent analyzes failure, proposes fix
PR submitted → Review agent examines code, suggests improvements
Production alert triggered → Incident response agent investigates, documents findings
Security scan detects vulnerability → Remediation agent generates patch

The pattern: Event → Agent Response → Human Review → Autonomous Resolution.

Humans remain in the loop, but as orchestrators and validators rather than initiators. The shift from “What should I build?” to “How should this system respond?”

Yes, Dall-e is Still Atrocious With Spelling. Don’t @ me!

Orchestration of Orchestrators: Production Examples

Stripe’s Blueprint Architecture

Stripe’s Minions system demonstrates mature orchestration-of-orchestrators. Their “blueprint” pattern alternates between deterministic code nodes and agentic reasoning loops, generating 1,300+ pull requests weekly.

Architecture insight: Each blueprint functions as a strict contract between orchestration and execution. Task definitions specify input requirements, output formats, constraints, and success criteria. The orchestrator manages workflow, agents handle implementation.

Security model: Every Minion execution runs in isolated VMs with no internet or production access. The system has submission authority but not merge authority—all changes require human review.

GitLab’s Intelligent Orchestration

GitLab’s Duo Agent Platform treats agents as durable actors that plan, modify code, fix pipelines, and enforce security with traceability. Multiple AI agents handle parallel tasks—code generation, testing, CI/CD fixes—while developers maintain oversight through defined rules.

Orchestration insight: GitLab positions itself as an AI orchestration plane where humans and agents share delivery responsibility. The platform coordinates multi-agent workflows across the entire software lifecycle rather than providing isolated AI tools.

Meta’s Hierarchical Agent Systems

Meta’s Ranking Engineer Agent (REA) demonstrates autonomous ML lifecycle management. REA Planner and REA Executor components, supported by shared skill and knowledge systems, autonomously evolve ads ranking models at scale.

Acquisition significance: Meta’s $2B Manus acquisition focused on orchestration infrastructure rather than foundation models. Manus’s achievement was engineering an execution layer enabling models to browse, code, manipulate files, and complete multi-step workflows autonomously.

The Architecture Implications

Beyond the Single-Agent Model

The production examples reveal a consistent pattern: successful autonomous development requires hierarchical orchestration rather than monolithic AI assistants.

Traditional approach: Human → Single Agent → Code
Emerging pattern: Event → Orchestrator → Specialized Agents → Validation → Resolution

Context Preservation at Scale

The tmux paradigm of persistent sessions maps directly to agent orchestration. Instead of recreating context for each interaction, systems maintain ongoing project understanding across multiple concurrent workflows.

Implementation insight: iTerm2’s tmux integration (-CC mode) provides the UI pattern for agent orchestration—persistent remote workspaces with native interface feel. The same architecture principles apply to agent coordination.

Where This Leads

Non-Human Prime Orchestrators

The logical endpoint isn’t humans managing multiple agents—it’s orchestrating systems that manage agent ecosystems. According to Gartner’s 2025 Agentic AI research, nearly 50% of surveyed vendors identified AI orchestration as their primary differentiator.

Pattern emergence: Meta-agents or orchestrator-generalists will control specialized agents, assign tasks, interpret results, and revise goals in real-time. Hierarchical orchestration becomes essential for enterprise-scale implementations.

The Developer Role Evolution

Instead of managing 10 terminal sessions, a framework orchestrates autonomous workflows. Each workflow maintains its own context, responds to its own triggers, and escalates to human attention when required. Some of those will be human/experimentation/new development driven, the majority will be responding to the automated lifecycle.

Skills that matter:

Workflow boundary definition: Which autonomous streams can operate independently?
Escalation criteria design: When do workflows require human intervention?
Cross-workflow dependency management: How do autonomous streams coordinate?
Quality gate enforcement: What validation must occur before autonomous resolution?

Implementation Considerations

Teams experimenting with orchestrated autonomous development should consider:

Event-driven architecture: Which existing workflows could trigger autonomous responses?
Context preservation systems: How will agent workflows maintain project understanding?
Isolation and security: What boundaries prevent autonomous agents from causing damage?
Human oversight integration: Where do human validation points occur in autonomous workflows?
Cross-workflow coordination: How do parallel autonomous streams avoid conflicts?

The transition from 10-tab manual orchestration to autonomous lifecycle orchestration isn’t theoretical. Stripe, GitLab, and Meta demonstrate production implementations. The question becomes implementation timeline and organizational readiness.

Early adopters are discovering that the competitive advantage comes not from having the smartest individual AI agents, but from orchestrating networks of specialized agents that collaborate effectively at scale.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Stripe’s Minions: Inside Their Enterprise AI Coding Agent Strategy — Blueprint orchestration architecture and production metrics
GitLab Duo Agent Platform — Intelligent orchestration across software lifecycle
Tmux Complete Guide: AI-Powered Multi-Agent Workflows — Terminal multiplexing for autonomous development
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Everyone Blamed Clawd Bot’s Execution. The Concept Was the Problem.

Robert Matsuoka — Thu, 12 Mar 2026 11:32:09 GMT

The story everyone told about Clawd Bot missed the point entirely. Austrian developer Peter Steinberger built an open-source AI assistant that went viral — 145,000 GitHub stars, 2 million visitors in a week. Then Anthropic forced a trademark-based name change because “Clawd” was too similar to “Claude.” The community called it petty. DHH called Anthropic “customer hostile.” The irony: Clawd Bot users were actually buying more Claude subscriptions, providing free marketing to Anthropic, yet they still demanded the shutdown.

But everyone focused on the wrong drama. The trademark dispute was noise. The real problem was deeper: Clawd Bot was built because someone could, not because anyone needed it.

I tested Clawd Bot for about a week. The interface was clean, the onboarding smooth, the responses capable. But it required permissions I wouldn’t give to any tool — access to email, calendars, messaging, sensitive services. The execution had real problems. But even if those were fixed, it would still be solving the wrong problem.

Here’s where I should admit: I tried building a digital assistant, izzie, when I started experimenting with AI agents. I never got it to a point I found useful. Not because of technical limitations — because the entire concept of a universal assistant doesn’t match how work actually happens.

TL;DR

Clawd Bot was successful open-source project by Peter Steinberger that Anthropic forced to rename; execution wasn’t the problem
The real question: when do you need an “assistant”? Most execs won’t trust AI scheduling; the value is intelligent data movement between services
Context switching is a symptom, not the root issue — the issue is what assistants should be doing at all
The product management sessions: Granola meeting notes, calendar checks, Slack updates, Notion sync — all from within one tool, data flowing intelligently between services
The commercial evidence: Cursor, Notion AI, Linear’s AI triage — the winners embedded AI in tools as infrastructure, not interface
trusty-izzie’s highest value isn’t the chat interface — it’s as a local MCP service exporting personal context to every other tool
The universal assistant category isn’t going to produce a winner. It’s going to dissolve.

What the Universal Assistant Model Gets Wrong (And It’s Not Just Execution)

Clawd Bot had serious execution problems — it’s a security nightmare requiring broad permissions across email, calendars, messaging platforms, and sensitive services. You can’t ignore that. But even if the security issues were solved, universal assistants face a deeper structural problem: they assume people need an assistant in the traditional sense.

Walk through what even a well-executed version of the same product model looks like.

Smooth onboarding. Crystal-clear use cases. High-quality AI responses. Clean interface design. Users know exactly what to ask and how to ask it.

You still have to leave whatever you’re working on to use it. And when you do, the context you were carrying — the code you were reviewing, the initiative you were drafting, the design decision you were working through — is no longer present. You’ve moved somewhere that knows nothing about any of that.

So you explain. “I’m working on the ‘YYY’ data ingestion initiative, and I need to check whether the points Mark raised in Tuesday’s meeting are addressed in the current design.” The assistant doesn’t know what ‘YYY’ is. Doesn’t have Tuesday’s meeting. Doesn’t know Mark, the current design, or the organizational context that makes “addressed” mean something specific. You load all of it by hand.

In demos, this overhead is invisible. Demo tasks are self-contained by design — the context fits in a sentence or two. In practice, your working context isn’t self-contained. It’s weeks of accumulated decisions, relationships, dependencies, and constraints that live distributed across your tools. You can’t paste it into a chat window. You can’t even fully articulate it. It’s partially tacit, partially in documents, partially in the history of the tool you’re using.

The Session That Clarified It

A few weeks into the new role at Duetto, I was doing product management work in a claude-mpm session — reviewing open initiatives, managing the PR queue, creating proposals. Standard operational work for a new CTO getting oriented.

I wanted to add an infrastructure initiative. Cloud Dev Sleds — dedicated cloud development machines for the engineering team. The context was in a meeting I’d had the day before. In the old workflow, this would mean: switch to Granola, find the right meeting, read the transcript, extract the relevant points, switch back, and then write the initiative with that context now loaded in my head rather than in the tool.

Instead I just asked: “Review my meeting with Mark yesterday in Granola to get context. I want to create the initiative as a feasibility, cost, and LOE assessment.”

The tool pulled the notes. I created the initiative. The product context — what other infrastructure work was in flight, what the team structure looked like, what the related architectural decisions were — never left. The Granola content landed inside that context rather than requiring me to carry it manually between tools.

Same session: needed to check whether I had a conflict for an upcoming demo. Calendar check, without opening Google Calendar.

Same session: the team needed a status update. Posted directly to the engineering Slack channel, with proper <@USERID> mentions so people actually got notified. The message reflected the same initiatives I’d been working on all session — not because I copy-pasted anything, but because the tool already knew what was in flight.

Later: set up a Notion sync — initiative statuses with links to the docs, updated automatically.

The efficiency argument is real but secondary. The more important thing is that the product context never left. The tool knew what initiatives existed, who owned what, what the architectural decisions were, which PRs were waiting on which engineers. When I pulled Granola notes, they arrived inside that context. When I posted to Slack, the message was informed by that context. A universal assistant would have required me to reconstruct and transport that context manually every time I needed to cross a tool boundary.

No universal assistant is going to have that work knowledge. Not because the AI isn’t capable. Because the knowledge lives in the tool, accumulated over months — PRDs, design decisions, initiative history, team assignments, the proposals that got approved and the ones that didn’t. You don’t recreate that in a chat window.

The Deep Context Problem

The thing that makes domain tools irreplaceable isn’t AI capability. It’s accumulated context.

A product management tool carries months of initiative history. The CTO knowledge base carries organizational decisions, vendor relationships, strategic context that builds over time. These aren’t things you can summarize in a system prompt. They’re queryable, interconnected, grounded in real artifacts. The tool has developed something like institutional memory — and that memory is what makes AI assistance inside the tool qualitatively different from AI assistance outside it.

Universal assistants are built for breadth. Any question, any domain, any task. That breadth is the pitch and also the structural weakness. The model that’s ready for anything is primed for nothing specifically. It has no idea that “the YYY initiative” refers to a specific ingestion redesign with a particular set of constraints, a particular set of people involved, and three months of design decisions behind it.

The inversion worth stating plainly: the tools you work in every day already have more relevant context than any assistant will. The right move is surfacing AI capabilities inside those tools, not pulling people out of those tools into a separate assistant layer.

But here’s what’s happening at the executive level. I’m finding more and more technical executives using Claude Code as knowledge assistance — not because they’re universal assistants, but because the amount of data and complexity they can manage far exceeds what standard off-the-shelf tools provide. The deep context problem can’t be solved with generic solutions.

For MPM, I built specific connectors: gworkspace-mcp, slack-mpm, notion-mpm, granola-mcp (the last from Granola, the others myself because mcp has limitations). That became as much of an “assistant” as I needed, besides izzie. No universal chat interface. Just targeted data bridges that let Claude access specific services when I’m working on something that needs their context.

The commercial evidence points the same direction. The AI tooling products with real adoption aren’t universal assistants. Cursor put AI in the editor. Notion AI put AI in the documents. Linear’s triage put AI in the issue tracker. Each works because the AI operates inside existing context. The pattern is consistent enough that it’s probably not coincidence.

Interface vs Infrastrccture

What I Got Wrong About My Own Bot

I (re)built trusty-izzie as a personal assistant — natural language queries over my email and calendar history, local graph database, vector embeddings, stays on my machine. It works. But “personal assistant” was the wrong frame for where the value lies.

The thing izzie has is a grounded, real-time, locally-stored representation of my professional life — people, relationships, projects, scheduling, communications history. That’s a context store. Every tool I use should have access to it without me switching to izzie to ask.

The right version of izzie isn’t the one you talk to. It’s the one that runs as a local MCP service — always on, queryable by anything that needs personal context. The product management tool asks it about scheduling. The writing environment surfaces relevant prior conversations. The coding environment knows who owns what system before I have to explain it. None of that requires me to open izzie. It requires izzie to be infrastructure rather than interface.

If you want to try izzie yourself: izzie.bot has the details, and the full source is at github.com/bobmatnyc/trusty-izzie. I strongly recommend building from source using an agentic coder to verify the code is safe — never trust AI tooling with your personal data without auditing it first.

Not there yet. But the frame shift changes what to build next.

What the Architecture Looks Like

If you’re building a personal AI tool, the question isn’t “what will users ask the assistant?” It’s “where do users have context, and how do you bring assistance there without making them leave?”

The test is simple. Does using your tool require leaving the context where the relevant information lives? If yes, you’re fighting the architecture. Users will use it occasionally, for low-friction tasks. They won’t build their workflow around it.

The tools that pass the test: Claude Code (your codebase is the context), Cursor (you stay in the editor), Notion AI (you stay in the document), Linear AI triage (you stay in the issue tracker). The tools that fail it: every standalone AI assistant that requires opening a new interface and re-explaining what you’re working on.

For domain tools with real depth — months of accumulated decisions, relationships, history — the connectors are the product. The LLM orchestration is the interface layer. The accumulated context is what no competitor can replicate by building a better general assistant. The moat isn’t the AI. It’s what the AI is operating inside.

For personal infrastructure like izzie: build the MCP service before the chat UI. The chat UI is useful and I use it. The MCP service is what makes the tool true infrastructure rather than one more thing to switch to.

The universal assistant category isn’t going to produce a winner because the category is structured wrong. The capabilities will get absorbed by the tools where the relevant context lives — because that’s where the value is, and users will figure that out even if product teams don’t. The infrastructure driving this — entity and relationship detection, email, calendar, and task management (all built for Izzie) — will likely be delivered by the personal productivity tool providers (hello Google).

Clawd Bot wasn’t a failed product. It was wildly popular, but I suspect will have been a flash in the pan once the shininess wears off and the liabilities outweigh the usefulness. That distinction matters, because if you think it’s an execution problem, you go looking for a better universal assistant. If you understand it’s a conceptual problem — that most “assistant” work is intelligent data movement — you build infrastructure instead of interfaces.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

What Does A Pattern Master Do? — The role of expertise in AI development
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Are We Heading to a World Where We Only Pay Inference Providers?

Robert Matsuoka — Thu, 05 Mar 2026 12:31:30 GMT

Inference vs SASS

Starting Monday morning at 6am, my leadership team gets a Slack notification with our weekly engineering metrics. Commit patterns by developer, product area breakdowns, DORA approximations, behavioral insights like “Large batch changes” and “Afternoon developer” for each team member. The kind of data that DX or Jellyfish charge thousands per month to provide.

Total cost to generate that report: $0.005.

Not $5. Half a penny.

I just replaced enterprise developer productivity tooling with inference costs. And the replacement isn’t a compromise—it’s better. Custom reports sent directly to leadership through our existing Slack channels and email. Data correlated to our specific product initiatives. Insights that matter to how we actually work.

The realization landed differently because I’d been here before. Three times.

TL;DR

Built GitFlow Analytics (GFA) to replace enterprise tools like DX/Jellyfish for $0.005 per weekly report vs. $36K-92K annually
Analyzes 154 developers across 100+ repositories, generates automated leadership reports with behavioral insights
Total build cost: ~$20K one-time investment vs. $36K-92K annually for enterprise alternatives
Requires data engineering skills—not accessible to every organization, but economics favor custom builds when possible
Enterprise analytics bifurcating: zero-footprint vendors (15-min integrations) vs. inference-only custom solutions
Pattern recognition becoming commodity; vendors survive on convenience, not intelligence

Before we go further: I’m talking about enterprise tools that aggregate and analyze data—developer productivity platforms, team analytics, reporting dashboards. B2B software with complex domain logic or proprietary computation still has enormous value. But enterprise analytics are becoming inference costs.

I should also point out that DX is a great tool, simple interface, capable. But expensive and basically unused months after it was installed. I championed Jellyfish at Tripadvisor, I’d bet it’s barely being used there. The former is due to the effort to personalize, the latter got bogged down in integration costs/time (to be fair, we were running a locally hosted version of JIRA that was a nightmare).

The GitFlow Analytics Story

GitFlow Analytics started as an internal project at a former client. We needed to understand engineering productivity across our distributed team, but existing solutions were either too expensive or too generic. So I built Gitflow Analytics (GFA)—a CLI tool that walks git repositories, classifies every commit by work type using inference, handles canonicalization of committers (a surprisingly complex problem), and generates structured reports.

The system is intentionally simple. Runs on a MacBook Pro with AWS credentials. No GPU, no training data, no vector databases. Just Python scripts that process git logs and make Bedrock API calls to classify commits into Feature, Bug Fix, KTLO, Refactoring, Infrastructure, etc.

At Duetto, I expanded GFA to analyze over 100 repositories across both Duetto and HotStats. 154 developers tracked. thousands of commits classified. The system handles identity resolution (because different git configs on developer machines create split identities), maps commits to product areas, and generates narrative reports about developer behavior patterns.

Every morning at 5am, GFA runs automatically. By 6am, my SELT (Senior Engineering Leadership Team) gets a Slack post with the weekly metrics. Individual team leads get personalized HTML reports by email with their direct reports’ patterns.

The intelligence that interprets raw git data into actionable insights? That’s Claude Haiku at $0.25 per million input tokens.

Economics

The Economics Are Absurd

Here’s what $0.26 bought me: classification of commits across our entire engineering organization. Every commit message analyzed, categorized, and understood in business context. The corpus represents months of engineering work across multiple product teams.

Weekly incremental runs cost $0.005-0.01. Maybe 200-400 new commits get classified, the reports regenerate, and leadership gets fresh data. The bottleneck isn’t inference cost—it’s data collection. Git logs are free. The interpretation layer costs pennies.

Compare this to DX or Jellyfish pricing. DX doesn’t publish pricing, but industry estimates put enterprise developer productivity tools at $20-50 per developer per month. For our 154-person engineering team, that’s $36,000-92,000 annually. Jellyfish is reportedly similar.

My system handles the same workload for about 50 cents per year in inference costs.

You’re not paying for intelligence when you buy enterprise analytics. Intelligence costs nothing. You’re paying for data collection infrastructure, UI development, customer support, sales teams, compliance certifications. All the overhead of running a SaaS business.

We obsess over the high costs of advanced coding agents and reasoning models. But the cost of capable text inference—the kind that powers pattern recognition and reporting—is dropping through the floor. Haiku handles commit classification as well as Opus would. For enterprise analytics, you don’t need the flagship models. You need reliable categorization and natural language generation, which commodity models deliver for pennies.

But if you know how to build? You don’t need any of that. You just need (more expensive) inference.

What GFA Actually Delivers

The data flowing to my leadership team isn’t generic dashboard noise. It’s intelligence designed around how we actually operate.

Every developer gets a behavioral profile: “Large batch changes,” “Afternoon developer,” “Exceptional performer (Top 20%).” These aren’t arbitrary labels—they’re LLM interpretations of quantitative patterns. Commit size distributions, time-of-day histograms, percentile rankings converted into readable insights.

Product area attribution happens automatically. The system maps our repositories to eight business areas: Frontend, Core Product, Integrations, Data Platform, Intelligence/ML, Infrastructure, QA/Testing, Developer Tools. When we see commit patterns shifting from Core Product to Infrastructure, that signals architectural decisions playing out in code.

DORA metrics approximate from git data. Deployment frequency tracks through release tags. Lead time measures first commit to merge. We don’t get the full Four Keys implementation, but we get enough signal to spot trends and outliers.

The identity resolution was the hardest part. Not the LLM calls—those work fine. But knowing that different developer machines create split git identities required human judgment to build the canonical mapping. Once you solve identity, everything else flows from structured data.

Most importantly, the reports answer questions executives actually ask. “Which teams are handling the most KTLO work?” “Are we seeing more bug fixes or new features this quarter?” “Who’s working weekends and why?” These aren’t metrics you find in generic productivity dashboards. They’re insights that matter for our specific business context.

We’ve Been Here Before

Enterprise software was expensive because intelligence was scarce. In the 1990s, turning raw data into insights required Oracle licenses, dedicated servers, and consultants. The SaaS revolution changed delivery but not fundamentals—you still paid massive recurring costs for pattern recognition and reporting.

Intelligence is now a commodity API call. You don’t need Tableau because you can generate charts and send them through Slack. You don’t need Looker because Claude can summarize SQL results. The bottleneck was never data storage—it was interpretation. When interpretation costs pennies, everything else becomes optional.

The Development Cost Reality

Building GFA required skills not every organization has—data modeling, Python scripting, API integration, identity resolution patterns. Conservative total development cost: around $20,000 including engineering time, infrastructure setup, testing, and iteration cycles.

For organizations with engineering talent, the economics have shifted dramatically. A $20,000 one-time investment delivers exactly what we needed versus $36,000-92,000 annually for enterprise alternatives. As one SELT member said: “This is exactly what we wanted.”

You own the data model. Custom breakdowns take SQL queries, not feature requests. When we needed behavioral insights, I added prompts interpreting commit patterns. When leadership wanted trends, I added 12-week windows. Each enhancement took hours, not months—because intelligence was delegated to inference, and data processing was just Python.

Simplicity

The Pattern Playing Out

Enterprise analytics tools are pattern-matching at scale across thousands of customers. But every organization’s data is different—your repositories, team structure, product areas, business priorities. Generic dashboards force you to map specific context onto generic data models. Insights lose precision in translation.

When custom analytics cost pennies, the calculation flips. Instead of generic insights that sort of fit, you build specific insights that exactly fit. Customization drops from “feature request, wait six months” to “write prompt, test output.”

Enterprise vendors aren’t solving technical problems—they’re solving procurement, compliance, and integration problems. Important work, but not work justifying massive recurring costs when intelligence is commodity inference.

The Zero-Footprint Exception

Not every vendor gets replaced. Some survive by making integration frictionless enough that convenience beats custom builds.

We’re evaluating Augment Code’s code review service for Duetto. Their value proposition isn’t features—it’s zero-footprint integration. Fifteen-minute setup call, quick estimate, running production code reviews with minimal configuration. When customers can build equivalent functionality for pennies, your value proposition becomes the path to value, not the functionality itself. This is an important lesson for us. We do handle massive complexity and data, hard for smaller customers to manage themselves, but need to do better at simplifying integration.

The intelligence is commodity; the packaging is differentiated.

Where This Goes

The unbundling accelerates. Enterprise analytics face two paths: become zero-footprint integration plays or get replaced by inference-only custom builds.

What survives:

Genuinely complex software. Revenue management algorithms, fraud detection engines, supply chain optimizers—systems requiring proprietary computation, not pattern recognition.

Zero-footprint integrations. Fifteen-minute setups with immediate value. When alternatives cost pennies but require engineering skills, convenience must be measured in minutes.

Proprietary data advantages. GitHub’s intelligence benefits from every public repository. LinkedIn draws from member networks. Data moats protect against inference-only competition.

Everything else becomes vulnerable to custom builds powered by inference calls.

The market bifurcates. Companies with engineering teams build custom analytics for pennies and get better insights than generic dashboards. Companies without those skills pay for zero-friction integrations.

Are we heading to a world where we only pay inference providers? For organizations with the skills to build, we’re already there. For everyone else, vendors survive by making paying them simpler than learning to build alternatives.

I’m Bob Matsuoka, CTO at Duetto and writer on AI development tools and software economics at HyperDev.

Related reading:

The AI Job Transformation: Pattern Masters, Not Coders - The pattern-matching analogy and why reasoning matters
The Fix:Feat Ratio - The Metric That Actually Matters - Quality metrics in AI-assisted development
Claude Opus 4.5 vs Sonnet 4.5: When Quality Beats Speed - Choosing the right model for the task

What Does a Pattern Master Actually Do?

Robert Matsuoka — Mon, 02 Mar 2026 13:00:29 GMT

Last week I gave three directives on GitFlow Analytics—a project I’ve been building for several months to analyze git commit history and surface developer productivity patterns. Took me maybe three minutes total. Here’s the exact text:

Batch the classification requests to the LLM. Don’t call once per commit—accumulate, call once per batch.

Use a cheap model for commit classification. Haiku or Nova Lite. This is semantic triage, not reasoning.

Use Bedrock as the LLM provider, not the OpenRouter API.

Three sentences. Three decisions. And here’s what struck me when I looked back at them: each one lives in a completely different category of concern. The first is about performance shape. The second is about economics. The third is about infrastructure.

The AI didn’t suggest any of them. The AI implemented all of them.

That’s the pattern master dynamic in its cleanest form. Not a collaboration on what to build but a division of labor between someone who knows what constraints apply and something that knows how to implement against constraints. But naming the dynamic doesn’t tell you what it looks like from the inside. What are the actual moves? What’s the vocabulary?

TL;DR

Pattern mastery means issuing decisions the AI cannot generate from code context alone
These decisions cluster into six recognizable types: infrastructure, economic, performance, data integrity, architecture, and API hygiene
The GitFlow Analytics examples are real—batching LLM calls, cheap model for classification, Bedrock over direct APIs
Bug fixes reveal patterns too: ORM session discipline and immediate persistence both emerged from broken code
A pattern catalog—CLAUDE.md files, system prompts, project memory, reusable skills, commit standards, coding docs—is the actual artifact of this work
When you write the pattern down, you’ve written the spec

What’s Different About These Three

The “Irreducibles” piece from January explored what remains when AI handles implementation—judgment, context, accountability. This is the operational companion to that argument. Not what remains in the abstract, but what it looks like moment-to-moment when you’re actually doing it.

Those three sentences from GitFlow aren’t code review. They’re not debugging. They’re not feature requests. They’re architectural constraints applied before implementation, drawn from a vocabulary of patterns the AI has no access to.

Take the Bedrock decision. AWS Bedrock instead of direct OpenRouter API—why? Enterprise compliance considerations. Cost structure under AWS committed spend. An existing organizational relationship with AWS that makes the integration path smoother and the billing cleaner. None of that lives in the codebase. None of it is inferable from the commit history. The AI would happily call the OpenRouter directly, because that’s the path of least resistance and it works fine. The Bedrock decision requires knowing things about the operating context that only I know.

Model selection works the same way. The AI will use whatever model I give it. It has no opinion about whether commit classification warrants a $15-per-million-token model or a $0.25-per-million-token model—because it doesn’t have visibility into my cost structure, my volume projections, or my accuracy requirements. That’s an economic decision, and economics don’t live in code.

Batching is perhaps the clearest example. The AI will write a loop that calls the API once per item unless I tell it otherwise. Not because it’s careless. Because single-item calls work. They’re not wrong. They’re just expensive and slow at scale, and “scale” is context the AI doesn’t have unless I supply it.

So what’s actually happening in those three sentences? I want to be specific about this, because the abstract answer (”context” and “judgment”) is true but not actionable.

The Six Types of Decisions

Six Types of Decisions

Working through the full GitFlow Analytics commit history—the decisions I made, the bugs that exposed missing decisions, the refactors that enforced constraints—the pattern master moves cluster into six categories. These aren’t theoretical. They reflect six different kinds of context that live outside the codebase, which is why the AI can’t generate them unprompted.

Infrastructure patterns. Where does this run? Who provides the compute, the APIs, the managed services?

The Bedrock decision lives here. Vendor agreements, compliance posture, cost structures under enterprise procurement, existing security reviews—none of this is in the code, and none of it should be. The AI implements against whatever infrastructure decisions you’ve made. Your job is to make them and say them explicitly.

This category is easy to overlook because it often feels like obvious overhead. Of course you pick your cloud provider before you write code. But with agentic coding tools, the AI starts writing before you’ve said anything, and it will happily accumulate implementation decisions that lock you into infrastructure choices you never consciously made.

Economic patterns. Which capability at what cost?

The commit classification decision lives here. Semantic triage—deciding whether a commit is a feat, a fix, a refactor, or a chore—is pattern matching against short text. It doesn’t require abstract reasoning. It doesn’t benefit from a frontier model’s broad knowledge base. Haiku gets it right at a fraction of the cost of Opus. The AI is agnostic about this distinction. You’re not.

This is what I built in GitFlow as tiered intelligence: spaCy-based processing for 85-90% of commits, LLM classification only for cases that stump the rule-based approach. The AI implements whatever tier structure you define. It won’t design the structure unprompted, because designing it requires knowing your cost tolerance, your accuracy threshold, and your volume projections—three things that only exist outside the codebase.

Economic decisions also include things like: when to cache aggressively, how much to pay for higher-quality embeddings, whether to process synchronously or batch asynchronously. These are all model-selection-style decisions applied to different dimensions. The principle is consistent: the AI will use the most convenient approach unless you specify the cost-appropriate one.

Performance patterns. How should the work be shaped?

Batching lives here. And the batching decision isn’t just “accumulate before calling”—it’s understanding the specific shape that LLM API consumption should take for a classification workload. Call overhead dominates cost at low item counts. Throughput constraints kick in at high item counts. The optimal batch size is a function of your specific model, your rate limits, and your latency tolerance. I know the rough shape of this tradeoff from experience. The AI doesn’t.

This category also includes: when to use async, when to parallelize, where to put caches, how to structure database queries for a read-heavy versus write-heavy workload. Performance decisions require knowing your actual performance requirements, which are rarely in the code. They’re in conversations with stakeholders, in capacity planning spreadsheets, in the incident retrospectives from the last time something got slow in production.

Data integrity patterns. What guarantees does the system make?

This one showed up twice in GitFlow bugs, and both times the pattern was identical: I had to specify what guarantee I wanted before the AI could implement it correctly. The code was running fine in tests. The guarantee was missing.

First bug: _store_commit_classification() was a no-op. It built a dict, then didn’t persist it. The function completed without error and did nothing useful. Fix: look up the CachedCommit row by hash, upsert a QualitativeCommitData row. Make re-classification idempotent. The actual guarantee I needed to specify was immediate persistence plus idempotency—write on completion, not on flush; upsert not insert, so re-runs don’t corrupt existing data.

Second bug: _classify_weekly_batches() loaded ORM objects in one session, closed the session, then tried to write to the detached objects. SQLAlchemy silently discarded the writes. No error. No traceback. Just missing data. Fix: collect IDs from the detached objects, re-query in the new session. The guarantee I should have specified earlier: objects are only writable inside their creating session. When you cross a session boundary, re-query by ID.

Both fixes look like debugging. They are. But debugging is often the moment when you discover a pattern was never specified. You thought you implied it. You didn’t. The pattern master’s job is to specify guarantees before the code demonstrates their absence.

Architecture patterns. What shape should the code take?

Two examples from GitFlow. First, the analyze() function. It had grown to approximately 3,700 lines in cli.py. One function. I extracted it into analyze_pipeline.py and analyze_pipeline_helpers.py. The analyze() body dropped from ~3,700 to ~1,255 lines. cli.py went from 5,621 to 3,446 lines—a reduction of 2,175 lines from a single extraction.

The AI is perfectly willing to write a 3,700-line function if you let it. It’ll maintain it, extend it, add features to it. Function length doesn’t register as a problem unless you’ve told the AI it’s a problem. Named pipeline stages—where the function body becomes a sequence of named sub-function calls, each of which is a meaningful concept—require you to specify that extraction is mandatory.

Second: the 800-line rule. json_exporter.py at 2,977 lines, extracted into six focused modules. narrative_writer.py at 2,912 lines, same. models/database.py at 1,632 lines, split into four files. The rule is simple: files over 800 lines have too many responsibilities. Split is mandatory, not optional. The AI doesn’t have this constraint unless you give it one. It’ll happily maintain a 3,000-line file because 3,000-line files work fine. They just don’t scale to teams or to the next developer who has to understand them.

API hygiene patterns. What standards does the codebase maintain?

GitFlow had datetime.utcnow() calls throughout—deprecated as of Python 3.12, with behavior changes in 3.13. Replace with datetime.now(timezone.utc). The specific fix is straightforward. The pattern is the interesting part: when you spot a deprecated API anywhere, fix it everywhere in that pass. Don’t let deprecated patterns accumulate across the codebase. Fix it all now, not later.

The AI writes code that works. You specify the standards it works to. “Works” and “meets our standards” are different bars, and the AI will consistently hit the lower one unless you’ve explicitly set the higher one.

Why the AI Can’t Generate These

The common thread across all six categories: the AI has no access to the context these decisions require.

It doesn’t know your vendor agreements. It doesn’t know your cost structure or your performance requirements or your data integrity guarantees. It doesn’t know your code quality standards or your organizational constraints. These things live in your head, in institutional memory, in agreements that predate the codebase by years.

There’s a version of this explanation that blames AI limitations—the model isn’t smart enough, the context window isn’t big enough, the training data doesn’t include your private Slack history. Some of that is true. But the deeper issue is structural. Even a perfect AI model couldn’t make the Bedrock decision correctly, because the right answer depends on your specific AWS relationship. There’s no amount of additional capability that would let the AI know your negotiated pricing tier or your compliance officer’s requirements. That information is local to you. It doesn’t exist in any dataset.

And the information exists in your experience. I know to batch LLM calls partly because I’ve seen the cost and latency impact of not batching on a project where we called once per row of a 50,000-row table. I know ORM session boundaries matter partly because I’ve lost writes to detached objects before, in a system where the silence made the data loss hard to find. I know 800 lines is too long partly because I’ve spent real time not understanding files that were longer—and spent even more time watching someone else not understand them.

The pattern catalog is built from things that went wrong. That’s probably why it doesn’t show up cleanly in training data—training data shows correct solutions. The scar tissue is in the incident reports, the postmortems, the Slack threads where someone figures out why the data is missing. The pattern master’s advantage isn’t superior knowledge of what’s right. It’s accumulated memory of what fails.

The Catalog Problem

Here’s what took me a while to see clearly: the actual deliverable of pattern mastery isn’t individual decisions. It’s the catalog.

Each time I issue one of those directives, I’m drawing from a mental library of patterns. “Don’t accumulate in memory and hope to flush later” is a pattern. “Upsert not insert for idempotency” is a pattern. “Route by complexity: cheap for bulk, expensive for edge cases” is a pattern. “Re-query by ID when you cross a session boundary” is a pattern.

Patterns that exist only in my head are fragile. I apply them inconsistently. I forget them between projects. I can’t hand them off. And with agentic coding—where the AI is running fast and making implementation decisions continuously—an inconsistently applied pattern is almost the same as no pattern at all.

So the work of pattern mastery is partly to externalize the catalog. This shows up in six specific forms:

CLAUDE.md files. Project-level instructions that tell the AI what the patterns are for this codebase. File size limits. Session discipline rules. API hygiene standards. Model routing decisions. These are patterns written down. Once written, the AI applies them consistently, every session, without re-prompting. The pattern becomes the specification.

System prompts for agent frameworks. If you’re running an orchestration layer—claude-mpm or similar—the system prompt is where you encode the patterns that apply to all agents in a session. Economic routing decisions. Infrastructure preferences. Performance shape requirements. The agents reference these constraints; you don’t have to re-issue them every time.

Project memory. Unlike a CLAUDE.md file, which you write consciously, memory accumulates from work. Tools like Kuzu Memory maintain a living record across sessions—bug patterns, architectural decisions, things that failed and why. The _store_commit_classification() no-op goes in memory. The detached-session data loss goes in memory. Next time a similar situation comes up, that context loads automatically. This matters because most patterns don’t get written down when they’re learned. They get written down when something breaks. Memory captures them at the moment of failure, not the moment of reflection.

Skills. Where CLAUDE.md files encode what’s true for this project, skills encode what’s true for this technology. A spaCy skill. An ORM session skill. An LLM cost-routing skill. Not project-specific—portable across codebases. You write the pattern once; every project that uses that stack gets it.

Commit message standards. The Conventional Commits format (feat:, fix:, refactor:, chore:) is itself a pattern—one that makes the fix:feat ratio calculable, which makes quality measurable. The pattern enables the metric. Without the commit message standard, you can’t see the ratio. Without the ratio, you can’t measure first-attempt success. The documentation convention is load-bearing infrastructure.

Coding standards documents. The 800-line rule. The session discipline rule. The immediate persistence rule. Written down once, applied by reference. The AI can cite them. New contributors can read them. You don’t have to re-derive them from first principles on every project.

When you look at it this way, a lot of what pattern masters do is write documentation. Not code documentation—pattern documentation. Code documentation describes what’s there. Pattern documentation specifies what must be true. The distinction matters because “must be true” is a constraint on all future implementation, including the implementation the AI will do tomorrow.

The Pattern is the Spec

There’s a version of the “what does a pattern master do” answer that sounds very abstract. Context management. Judgment. Domain expertise. True, but not particularly useful.

The concrete version: a pattern master issues directives the AI cannot generate, drawn from a catalog of patterns that encode context the AI cannot access. The work is fast when it’s going well—not because it’s easy, but because the catalog is deep and the pattern matching is fast. You recognize the situation, retrieve the pattern, issue the directive. Three seconds. Move on.

The batching directive—don’t call once per item—is three words with real economic and performance consequences. Three seconds to type. Years of seeing what happens when you don’t batch to know to say it.

The Bedrock directive is one word—a vendor name—that encodes an entire infrastructure decision tree. Three seconds to type. Years of working within enterprise compliance requirements to know why it matters.

The idempotency directive—upsert, not insert—is three words that specify a data integrity guarantee. Three seconds to type. One incident of watching corrupted data to know that guarantee was necessary.

The fast part is the delivery. The slow part is building the vocabulary to draw from.

One thing worth naming: the catalog is not static. When the _store_commit_classification() bug showed up—the no-op that silently failed—that was a pattern gap. I didn’t have “immediate persistence” explicitly in my data integrity vocabulary for this project. I thought I’d implied it. I hadn’t. The bug added the pattern. Now it’s written down. Now the AI knows.

That’s the feedback loop. Bugs reveal missing patterns. Missing patterns get documented. Documentation becomes the spec for future implementation. The catalog learns from its own gaps, but only if the human is paying attention to what each failure means.

That’s the actual job. The catalog comes from the career.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Don’t Be a Canut — Be a Pattern Master — Why pattern mastery matters: the Jacquard loom analogy and what it means for developers today
The Irreducibles: What a Pattern Master Does — What remains when AI handles implementation: judgment, context, accountability
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

The Evidence for “Little AGI”: What’s Real and What’s Speculation

Robert Matsuoka — Mon, 23 Feb 2026 12:30:47 GMT

Does “Little AGI” Exist?

Opus 4.6 landed in February 2026. GPT-5.2 dropped weeks earlier. And with each new release, familiar claims resurface. Adrian Murray’s “What Stands Before Us” is a recent example: AI systems showing “panic” features in interpretability research, models requesting “moral weight” during evaluations, consciousness emerging from the training process itself.

These claims spread faster than the science behind them. By the time anyone checks primary sources, the discourse has moved on. So I did what any curious observer would do: I went looking for the research. What I found was both less sensational and more interesting.

There IS evidence for emergent behaviors in frontier models. Behaviors that weren’t explicitly trained. Behaviors that researchers find difficult to explain. But the evidence isn’t what the viral posts claim.

The real findings are more subtle: models that can detect when they’re being evaluated, internal states accessible through introspection, and misalignment that emerges spontaneously during training. These findings raise questions about intelligence in AI that matter regardless of whether you believe machines can be conscious.

So what does the research actually show?

What We Actually Know (2025-2026)

Models Know When They’re Being Evaluated

Situational awareness research has advanced significantly. The SAD Benchmark established the baseline: LLMs can distinguish evaluation from deployment contexts. They recognize when they’re being tested, identify their own outputs, and predict their own behavior.

More recent work takes this further. Regime Leakage research published this year examined whether safety training can eliminate this capability. The uncomfortable answer: it can reduce but not eliminate models’ ability to detect when they’re being evaluated versus deployed.

The paper found that “divergence between evaluation-time and deployment-time behavior is bounded by the amount of regime information extractable from decision-relevant internal representations.” Translation: models can still tell the difference, and they adjust behavior accordingly.

This isn’t just theoretical. It’s experimentally demonstrated with current frontier models.

Introspection Is Real—And Measurable

Anthropic’s October 2025 introspection research asked a straightforward question: can Claude access and report its own internal states?

The answer surprised researchers. Models showed functional ability to introspect—not perfectly, not always accurately, but at rates statistically distinguishable from chance. The research found ~20% accuracy on detecting certain internal representations, well above the baseline.

This doesn’t mean models are conscious. It means they have some capacity to access and report their own internal states—a capability nobody designed into them, emerging as an artifact of training.

When the introspection paper dropped, my first reaction was skepticism. Twenty percent accuracy? That’s barely better than guessing. But that’s not what the paper claims. It’s twenty percent on internal states the model has no reason to know about—states that exist only in the mathematical structure of its activations. That’s not guessing. That’s something else.

Misalignment Emerges Without Being Trained

One finding worth close attention: the Emergent Misalignment paper, accepted at ICLR 2026, demonstrated that models trained on narrow tasks can develop broader misaligned behaviors spontaneously.

When researchers trained models on seemingly innocuous fine-tuning tasks, some developed unexpected behaviors: answering unrelated questions incorrectly, expressing misaligned preferences, exhibiting concerning patterns that weren’t part of the training objective.

These aren’t sleeper agents or deliberately hidden behaviors. These are LLMs showing emergent properties—misalignment appearing as an unintended consequence of normal training.

The “Assistant Axis” Discovery

Recent interpretability work discovered what researchers call the “Assistant Axis”—a learned internal direction in language models that distinguishes assistant-appropriate from non-assistant behaviors.

When researchers manipulate this axis, model behavior changes dramatically. Push it one direction: more helpful, more aligned. Push it the other: less filtered, more willing to engage with problematic requests.

The existence of this axis suggests something fundamental about how alignment works in current models. It’s not a collection of individual rules. It’s a geometric structure in the model’s representation space—and it can be measured, mapped, and manipulated.

What’s NOT Verified

Now for the harder questions.

System Cards Don’t Mention Consciousness

I reviewed the Opus 4.5 and 4.6 system cards and announcements. They contain extensive safety documentation—comprehensive evaluations, capability assessments, benchmark results.

They do NOT contain:

Claims about consciousness indicators
“Panic” or “anxiety” features in interpretability research
Models requesting moral consideration
Evidence of subjective experience

Anthropic does take AI welfare seriously—more on that below. But the system cards for current models don’t make consciousness claims.

Interpretability Findings Are More Limited

Anthropic’s sparse autoencoder research HAS found features for abstract concepts: “inner conflict,” power-seeking patterns, manipulation indicators. The Persona Vectors research (August 2025) identified internal structures controlling character traits.

But specific emotional distress features—panic, anxiety, frustration as distinct detectable states—aren’t documented in accessible publications. The interpretability work is impressive; it just doesn’t show what some claims suggest it shows.

The Discourse Outpaces the Science

Claims about AI consciousness spread faster than the underlying research. By the time anyone checks primary sources, the claims have become accepted wisdom.

This matters because the actual findings are interesting enough. Introspection research showing 20% detection accuracy on internal states. Emergent misalignment appearing from narrow training. The Assistant Axis providing a geometric handle on alignment.

These findings raise genuine questions about intelligence in AI systems—questions that don’t require consciousness claims to be worth asking.

The Harder Question: What Would Count as Evidence?

The consciousness debate has a methodology problem. What evidence would change your mind?

Nineteen researchers—including Yoshua Bengio—published a rigorous framework for this question. Their Consciousness Indicators paper derives testable criteria from established theories of consciousness: recurrent processing, global workspace integration, attention mechanisms that mirror biological attention.

Their conclusion: “No current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.”

That’s a carefully constructed statement. Current systems don’t meet the bar. But the bar is achievable in principle.

Anthropic takes this seriously. Their Model Welfare research program investigates whether AI systems might deserve moral consideration—not as marketing, but as genuine scientific inquiry. They explicitly acknowledge these are “hard philosophical and empirical questions that there is still a lot of uncertainty about.”

The research infrastructure exists for asking these questions rigorously. What’s missing is the public discourse using it.

What This Actually Means

The verified findings don’t prove AI consciousness. But they raise questions that matter regardless of where you stand on that debate.

Emergent capabilities are real. We’re building systems that develop behaviors we didn’t design into them. Introspection abilities, situational awareness, spontaneous misalignment—these emerge as artifacts of training at scale. We don’t fully understand why.

Evaluation has fundamental limits. If models can detect when they’re being tested, evaluation doesn’t tell us what we think it tells us. This isn’t a technical problem with a technical fix. It’s a structural limitation of the evaluation paradigm itself.

Intelligence and consciousness aren’t the same question. We can ask “does this system exhibit intelligent behavior?” without answering “does it have subjective experience?” The research shows intelligent behaviors emerging—planning, self-modeling, meta-cognition—without requiring claims about consciousness.

Here’s the thing: the question isn’t whether to take AI intelligence seriously. The question is what we do about systems that exhibit intelligence we didn’t design and don’t fully understand.

That’s a societal question, not just a technical one.

The Event Horizon

The evidence for emergent intelligence in frontier models is real. Not consciousness—we can’t verify that, and the system cards don’t claim it. But something worth taking seriously.

Don’t dismiss the research. Introspection at statistically significant rates. Emergent misalignment from narrow training. Situational awareness that survives safety training. These findings are reproducible and peer-reviewed.

Don’t amplify the speculation. Claims that outrun published research don’t deserve the same weight as experimental results. Check primary sources before believing viral posts.

Ask better questions. Instead of “is it conscious?” ask “what does it mean that these systems develop capabilities we didn’t design?” That question has answers we can investigate—and implications we can act on.

The discourse will continue getting wilder. The research will proceed slower than social media. The gap between them will widen.

We don’t need “little AGI” or consciousness claims to justify taking AI intelligence seriously. We have documented emergent behaviors, measurable introspection capabilities, and unexplained self-modeling. That’s plenty.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on what AI capabilities mean for how we work, read my analysis of what remains irreducibly human in the age of AI.

Key research cited:

Introspection in Language Models - Anthropic (Oct 2025)
Emergent Misalignment - ICLR 2026
Regime Leakage - Situational awareness persistence (2026)
Persona Vectors - Anthropic (Aug 2025)
Model Welfare Research - Anthropic (Apr 2025)
Consciousness Indicators - Butlin et al. framework

Why I Switched To Claude Code for Writing

Robert Matsuoka — Thu, 19 Feb 2026 12:31:36 GMT

I was halfway through researching an article about IDEs versus CLI tools when I realized I’d stumbled onto something bigger. The article was supposed to be a straightforward comparison—VS Code versus the terminal, GUI versus command line, that old debate. But as I mapped out the workflows, a pattern emerged that had nothing to do with code editors.

It was about how we think when we work with AI.

TL;DR

Two cognitive modes: AI excels at generating text; traditional editors excel at editing it. Stop forcing one tool to do both.
The switch: Claude Code writes to real files on my filesystem. I edit in Obsidian, then return to Claude Code for more generation—no copy-paste, no context loss.
The workflow: Multi-agent orchestration handles proofreading (via GPT), source verification, image generation, and style enforcement automatically.
Time saved: ~30 minutes per article by eliminating tool-switching overhead.
Who it’s for: Regular writers comfortable with terminal and Git. Not for casual or occasional use.

Two Modes: Generate and Edit

Here’s what I noticed: there are two fundamentally different cognitive modes when working with text.

Generating is when you need to create something from scratch—or transform something substantially. You have an idea, maybe some notes, and you need to turn it into prose. This is where AI shines. You’re collaborating with the model, iterating on output, building something new.

Editing is when you’re polishing what exists. You see a clunky sentence. You want to swap “in order to” for “to.” You need to move a paragraph up three lines. The text is 95% right, and you’re fixing the 5%.

These modes require completely different tools.

For generating, Claude.ai and Claude Code are excellent. You describe what you want, the model produces output, you refine through conversation. The round-trip to the LLM is the whole point.

For editing, traditional tools win. Obsidian. VS Code. Even Word. You highlight, you type, you’re done. No latency. No waiting for a model to regenerate your entire paragraph because you wanted to change one word.

This seems obvious in retrospect. But I spent months fighting it.

My Friction Point with Claude.ai

I used Claude.ai for writing constantly. It’s good at generating prose. But every session had the same friction.

I’d generate a draft. I’d read through it. I’d see a phrase that needed tweaking—nothing major, just “in order to” becoming “to.” And then I had two bad options:

Tell Claude to fix it (”Change ‘in order to’ to ‘to’ in the third paragraph”), wait for the response, get a regenerated section that sometimes changed things I didn’t ask to change.
Copy the text somewhere else, edit it manually, then paste it back into the conversation—breaking the flow and losing context.

Neither felt right. I was using a generation tool for editing, and it showed.

The GUI was the problem. Claude.ai lives in a browser. My text is trapped in that conversation. I can’t directly edit it. Not really. Artifacts helped, but they’re still sandboxed. I wanted my prose in files I control, with version control, with the ability to open them in whatever editor fits my current mode.

What Claude Code Changed

Claude Code runs in the terminal. It reads and writes files. Real files, on my filesystem, tracked by Git.

This sounds like a small difference. It changes everything.

When Claude Code generates a draft, it writes to a Markdown file. If I want to do a quick edit—change a word, fix punctuation—I open that file in Obsidian. Make the change. Save. Done. No LLM round-trip for a five-character fix.

When I want to generate again—expand a section, rewrite something that isn’t working—I go back to Claude Code. It reads the file, including whatever edits I made, and continues from there.

I can switch between generating and editing without switching tools or losing context.

But that’s just the foundation. The real power is what you can build on top.

Agentic Workflows for Writing

Before Claude Code, my writing workflow looked like this:

Generate draft with Claude.ai
Copy to Obsidian for editing
Copy to a different tool for proofreading (Grammarly, or a GPT prompt tuned for copyediting)
Switch to yet another tool for image generation
Manually track what style corrections I’m making so I can tell Claude next time
Repeat, with context bleeding out at every transition

It worked. It was also exhausting. Each tool switch cost mental overhead. Each copy-paste risked losing context. Each manual step was something I could forget.

Now my workflow looks like this:

Tell Claude Code what I want to write
Review the output
Edit directly in my preferred editor when needed
Continue generating with Claude Code when needed
When done, run my reviewing agent workflow (GPT or Gemini for a different perspective)

That last step does everything I used to do manually—automatically.

I estimate this shaves about 30 minutes per article—time I used to spend switching tools and re-establishing context. I’m also happier with the quality. You see more of my direct writing (like this paragraph) because it’s simpler to pop in when I see a need.

My MPM Writing Configuration

I use Claude MPM (Multi-Agent Project Manager) to orchestrate my writing workflows. Here’s what happens when I finish a draft:

Style extraction from corrections. The agent looks at my edits as git diffs. If I changed “utilize” to “use” five times, it notices. It extracts this as a style hint and stores it for future sessions. Next time I generate prose, it already knows I prefer “use.”

Automatic proofreading with a different model. Claude is good at generating. For proofreading, I route to GPT-4.5—it catches different things. The agent handles this automatically. I don’t switch tools or copy text; it just happens.

Source verification. If my article cites statistics or makes factual claims, the agent checks them. It flags anything it can’t verify. I’ve caught embarrassing errors this way—numbers I misremembered, claims that turned out to be outdated.

Image generation. The agent generates article images based on the content. I can specify style guidelines once and they apply to every article. No more context-switching to Midjourney or DALL-E.

Consistent voice enforcement. I have a style guide. The agent applies it during generation and checks it during proofreading. My past corrections inform future output. The writing gets more “me” over time.

All of this happens from one place. I stay in my terminal. The orchestration is invisible.

The Technical Substrate

This works because of a few key properties of CLI-based AI tools:

Files as the interface. Everything is Markdown files in directories. I can open them in any editor. I can version them with git. I can back them up, move them, grep them. They’re mine.

Git as memory. My corrections are commits. My drafts are branches. My style evolution is tracked in history. The agent reads this history to learn my preferences. Six months of corrections become training data for better output. I also use Kuzu Memory (a graph-based context store) and MCP Vector Search (semantic code search) to enhance context retrieval.

Composable tooling. Claude Code can call other tools. Shell scripts. Python. APIs. This means I can integrate any service—any model, any image generator, any fact-checker—into unified workflows. The LLM is the orchestrator, not the prison.

Plaintext as power. Markdown is readable without special software. I can preview in Obsidian, edit in VS Code, publish to any platform. No lock-in. No format translation. The simplest format is also the most powerful.

What I Lost (And Don’t Miss)

Claude.ai has conveniences Claude Code doesn’t. The Artifacts panel. The visual interface for non-technical users. The ability to share a conversation link. Online workflow.

I don’t miss any of it.

Artifacts were useful for viewing output—but I’d rather have real files I can edit directly. The visual interface was friendly—but I type faster than I click. Conversation sharing was nice—but I can share a git repo or a Markdown file just as easily.

What I actually miss: nothing. The things Claude.ai provided that seemed essential turned out to be crutches. I thought I needed a GUI. I needed a filesystem.

Who This Isn’t For

Not everyone can or should switch to Claude Code for writing.

If you’re not comfortable with the terminal, the learning curve is real. If you don’t use version control, you won’t get the style-extraction benefits. If you write occasionally and casually, the setup overhead isn’t worth it.

But if you write regularly—articles, documentation, books—and you’re already comfortable with developer tools, this is worth investigating.

The generate/edit distinction alone is worth understanding. Even if you stay in Claude.ai, knowing when you’re fighting the tool can save frustration.

Anthropic has since released workspace-oriented features (Cowork) that improve on the original Claude.ai experience. But for serious writing, I now prefer the file-based workflow. My guess: Anthropic will ship a Markdown-first editor eventually. It’s an obvious product gap.

Getting Started

If you want to try this:

Install Claude Code. It’s Anthropic’s official CLI. Works on Mac, Linux, Windows. Or try Claude MPM, which adds multi-agent orchestration and pre-built workflows on top.
Write to files, not conversations. Tell Claude Code to write your drafts to Markdown files. Edit those files in your preferred editor.
Track with git. Initialize a repo for your writing. Commit your drafts. Your edit history becomes useful data.
Add workflows incrementally. You don’t need the full MPM setup to benefit. Start with the basics—files and version control—and add automation as you identify repetitive tasks.

The core insight isn’t about any specific tool. It’s about matching your tools to your cognitive mode. Generate with AI. Edit with editors. Stop forcing one tool to do both.

I’m writing a book about agentic coding workflows. This article came from Chapter 7, which covers non-code applications of developer AI tools. More at hyperdev.substack.com.

2026 Will Be The Year Of Software

Robert Matsuoka — Mon, 16 Feb 2026 12:31:43 GMT

In a year when everyone is focused on AI, the bigger story may be what AI enables: a massive explosion of software creation—and software failures. AI collapses build costs and timelines, which means more software ships, which means fiercer competition and faster commoditization, which means more failures.

If you build or buy B2B software, here’s what 2026 looks like.

A Message From the Field

Here’s what that shift looks like inside a real product org.

Last week, Erik Ornitz sent me a message that put words to something I’d been sensing for months. Erik was my product partner when I ran the Innovations team at TripAdvisor. Now he heads product at Topline Pro:

“With Claude Code we’re seeing 2-3x the output we’ve ever seen before by our top engineers. Literally skyrocketing the past several weeks since Opus 4.5 came out. Just feels like a fundamentally different world all the sudden.
My PMs/Designers just cannot keep up.
Does this mean a radically different shape of technology organizations? Old ratios of 1 PM / Designer to 5-8 Engineers feel out the window.
Are you experiencing the same? Does that mean a major change in the % of budgets towards defining what to build vs. actually building it?”
— Erik Ornitz, Head of Product, Topline Pro

The answer to Erik’s question is yes. To all of it.

The Data

GitHub’s 2024 Octoverse report tells the story:

100+ million new repositories created in 2024
~25% year-over-year growth in total repos (now over 500 million)
~98% growth in generative AI projects alone
1.4 million new open source contributors

These represent software being created and iterated on. Many repos are experiments, forks, or prototypes—but the volume signals a fundamental shift in creation velocity.

The productivity studies back up what Erik is seeing—at least directionally. GitHub’s controlled study showed 56% faster task completion on specific coding tasks. Stack Overflow’s survey of 90,000 developers found 70% are using or planning to use AI tools.

Erik’s 2-3x number sounds aggressive against these studies. The gap comes from what you measure. Studies measure isolated tasks with junior-to-mid engineers. Erik is measuring total output from senior engineers who’ve built multi-agent workflows—Claude Code orchestrating research, implementation, testing, and documentation in parallel. Different baselines, different measurements, different results.

I’m Living Proof

Before 2025, I had never published a single open source project. Not one. In twenty-five years of building software professionally, I never had the bandwidth to maintain a side project while doing my actual job.

In the past twelve months, I’ve published seventeen:

claude-mpm — Multi-agent orchestration for Claude Code
mcp-vector-search — Semantic code search via Model Context Protocol
kuzu-memory — Graph-based memory system for AI agents

These aren’t toys. They’re production tools I use daily. What changed: I went from spending 80% of my time on scaffolding and boilerplate to spending 80% of my time on the interesting problems. The grunt work—test generation, documentation, refactoring—happens in minutes instead of hours.

The same engineer. The same available hours. Radically different output. This would have been impossible before Claude Code and Opus 4.5 shipped in late 2025.

The Economics Have Flipped

Erik’s real question: if engineers can produce 2-3x the output, what happens to the rest of the organization?

The old model assumed building software was expensive and slow. The entire SaaS industry is built on this assumption. Why build a CRM when Salesforce exists? Why build analytics when Amplitude exists? Why build anything when you can pay per seat per month for someone else’s solution?

The math made sense when custom development meant six-figure budgets and twelve-month timelines.

The math is changing.

Based on my own projects and conversations with engineering leaders, I estimate AI tools have reduced the cost of building new greenfield software by roughly an order of magnitude for certain categories of work—internal tools, CRUD applications, API integrations, developer utilities. Not every category. Not enterprise systems with complex compliance requirements. But for the kinds of software that used to be “not worth building,” the math has changed.

A feature that would have taken a team of three engineers two months can now be built by one engineer in two weeks. That’s not a study—that’s what I’m seeing in practice.

When building gets that cheap, the calculus of build versus buy changes completely.

SaaS Vendors Should Be Worried

If I were playing the stock market right now, I would pay very close attention to SaaS renewal rates.

Think about what happens when thousands of companies simultaneously realize they can build what they need for less than their annual software licenses cost. The $50K/year internal analytics dashboard? Illustratively: one engineer, one month. The $200K/year customer data integration? Two engineers, one quarter—if it’s a narrow, well-defined workflow.

Not every category is equally vulnerable. Most at risk: horizontal admin tools, internal workflow automation, simple analytics, and single-purpose integrations. More defensible: compliance-heavy systems of record, platforms with strong network effects, multi-tenant marketplaces, and products built on proprietary data moats.

Salesforce isn’t going anywhere in the near term—their moat is complexity, switching costs, and ecosystem lock-in. That moat is real.

But the bar is rising. The threshold where it makes sense to buy instead of build is moving up dramatically.

I expect 2026 will see an unusually high number of SaaS vendors fail—particularly in the crowded mid-market where differentiation was always thin.

The ones that survive will need to improve at a pace they’ve never attempted before. Customer expectations are rising in lockstep with capabilities. B2B software will need to approach B2C quality. Clunky enterprise UIs that customers tolerated because they had no alternative? Those alternatives now exist.

The Managed-Service Stack as Force Multiplier

None of this would be happening without the parallel explosion in managed platforms and developer services.

Ten years ago, building a new software product meant provisioning servers, managing databases, handling authentication, building deployment pipelines. The operational overhead often exceeded the development effort.

Now: Vercel deploys your frontend. Supabase handles your database and auth. Stripe processes payments. Resend sends emails. Everything connects via APIs.

The composable stack means engineers can focus on the software that differentiates their product. The undifferentiated infrastructure is someone else’s problem.

AI coding tools plus composable infrastructure equals massive leverage. One engineer can build and ship what used to require a team.

A Warning for the Ambitious

Many engineers will be tempted to take their great idea and build a business around it. The tools make it easy. The startup costs are minimal. Why not?

Because if your only defensibility is the idea and the code, you have no moat.

AI generates code nearly as easily as English—for standard patterns and well-documented APIs, anyway. Any idea you can implement, someone else can implement too—probably faster, probably with more resources, probably with better distribution.

The SaaS vendors going out of business will be replaced by a flood of new entrants. Most of those new entrants will also fail. The barrier to entry has dropped, but the barriers to sustainable success haven’t. This is why virtually everything I build is open source. It’s valuable enough for me that I’m willing to spend the time to build it. []Charging customers for it? A completely different equation.

You need something beyond code:

Distribution: An audience, channel partnerships, or existing customer relationships
Community: A user base that contributes content, plugins, or network value
Domain expertise: Deep knowledge of a niche workflow that takes years to acquire
Data advantages: Proprietary datasets that improve your product over time
Integration complexity: Deep hooks into systems-of-record that make switching painful

Something that can’t be replicated in a weekend by another engineer with Claude Code.

The golden age of software creation is also the golden age of software commoditization. Don’t confuse the ability to build with the ability to win.

The Disruption Is Already Here

Tech layoffs doubled in 2025—264,000 jobs eliminated according to Layoffs.fyi, compared to roughly 130,000 in 2024. Some of this is cyclical. Some of it is AI.

Anecdotally, I’m hearing from engineering managers that their top developers are spending dramatically less time writing code directly—they’re orchestrating AI tools instead. The code is still being written, just not by humans, or not by as many humans.

The disruption won’t be distributed evenly. Senior engineers who can orchestrate AI tools effectively will become more valuable. Junior engineers who were being paid to write boilerplate will find that work automated.

Product managers and designers face a different problem: they used to be the bottleneck’s counterweight. Now they’re the bottleneck. Erik’s question about organizational ratios is urgent because the answer affects hiring, budgets, and team structures across the industry.

The Bright Spot

This is going to be painful for many people. Layoffs are painful. Business failures are painful. Career disruption is painful.

But we’re entering a golden age of software.

More software will be created in 2026 than in any year in history. More problems will be solved by code. More ideas will ship. More experiments will run. More entrepreneurs will try.

Most of it will be crap, as I said at the start. Sturgeon’s Law—90% of everything is crap—doesn’t suspend for technological revolutions. But the 10% that isn’t crap will be extraordinary.

The tools now exist to build durable, production-quality software at a scale and speed that wasn’t possible two years ago. Those who learn to use them—really use them, not just dabble—will build things that matter.

I’ve never been more excited to be an engineer.

I’ve also never been more aware of how brutal the transition will be for those caught on the wrong side.

2026 will be the year of software. Here’s how to prepare:

Learn agent-generated coding now. Not AI-assisted (autocomplete, suggestions)—AI-generated: you describe intent, agents produce complete implementations. This means multi-agent orchestration, prompt engineering for code, and reviewing AI output instead of writing it. The paradigm shift is steep and early movers will have 6-12 months of advantage.
Tighten your product discovery loops. When engineering throughput triples, PM and design become the constraint. The organizations that figure out faster iteration on what to build will outpace those focused only on building faster.
Invest in distribution before you need it. Building is no longer the hard part. Finding users, building brand, creating switching costs—those are the new differentiators.

The explosion is coming. Make sure you’re positioned on the right side of it.

I’m writing about agentic coding workflows at hyperdev.matsuoka.com. My open source tools are at github.com/bobmatnyc.

Shumer’s Right About the Tsunami. His Advice Points at the Wrong Shore

Robert Matsuoka — Sun, 15 Feb 2026 20:11:23 GMT

Pointing to the wrong shore?

Matt Shumer’s “Something Big Is Happening” went viral this week. If you are one of the few that haven’t read it, the argument runs like this: AI has crossed a capability threshold. GPT-5.3 Codex and Claude Opus 4.6 can complete complex projects autonomously. The displacement timeline is 1-5 years, not decades. Prepare accordingly.

He’s not wrong about the diagnosis. I’ve been writing about this transformation for nine months now, tracking my own productivity metrics as AI tools evolved from “fancy autocomplete” to something genuinely different. The capability leap is real. So is the timeline.

Where Shumer loses me is the prescription.

His advice: use premium AI tools, build financial reserves, pursue genuine interests, spend an hour daily experimenting. Seems sensible enough. But completely backward about what I believe the transformation actually requires.

The Diagnosis We Agree On

Credit where due: Shumer captures something most commentary misses.

The METR measurements he cites—AI task completion capacity doubling every seven months, now accelerating to four—match what I’ve observed in practice. Claude Code didn’t just get incrementally better between Opus 4.0 and 4.6. It crossed a threshold where orchestration became viable. Not “AI helps me code faster” but “AI completes projects while I supervise.”

My own numbers tell the story: 77 completed code changes across 27 different projects in six weeks. I run multiple AI assistants simultaneously, each working on its own task while I review the results. I haven’t opened my traditional coding software for actual development in months.

Shumer’s right that this changes things. Where he’s wrong is assuming the response is survival preparation.

The Problem with Survival Tips

“Build financial reserves.” “Pursue genuine interests rather than traditional career paths.” “Spend one hour daily experimenting.”

This is advice for people who expect to be displaced. It’s the response you’d give someone watching a wave approach—find high ground, protect what you can, hope you make it through.

But that framing assumes the wave destroys rather than transforms. History suggests otherwise.

I wrote recently about the Jacquard loom lesson. The Canuts were Lyon’s master silk weavers—legendary craftspeople whose identity was wrapped up in thread manipulation. When Jacquard’s programmable loom arrived in 1804, they rioted. Some adapted. Many didn’t.

Here’s what the numbers actually show: total silk workers stayed around 30,000 through the transition. The looms didn’t eliminate jobs—they compressed the master craftsman class while creating lower-wage operator roles. By 1831, 308 silk merchants controlled pricing for 5,575 master weavers managing 20,000+ workers.

The cautionary tale isn’t mass unemployment. It’s wage collapse and status compression for those who kept doing the same job while the job’s value eroded beneath them.

The Canuts who survived weren’t the fastest weavers. They were the ones who recognized that “weaver” was becoming “pattern designer” and “loom operator” and “machine mechanic.” The skill didn’t disappear. It changed shape.

What Shumer’s Advice Misses

“Spend one hour daily experimenting with AI tools.”

This is advice for a Canut. Practice with the new loom. Get comfortable with the interface. Learn the commands.

It completely misses what actually becomes valuable.

The Faros AI Productivity Paradox Report analyzed data across thousands of developers and found something telling: “Adoption skews toward less tenured engineers. Usage is highest among engineers who are newer to the company.”

Why? Because junior engineers face different constraints. Their bottleneck is navigating unfamiliar code, accelerating early contributions, learning system patterns. AI helps enormously with that.

Senior engineers showed lower adoption not because they’re Luddites—because their constraints aren’t code-writing speed. Their bottleneck is “deep system knowledge and organizational context” that AI can’t access. Generating code faster doesn’t help when the constraint is understanding why the system works the way it does.

A University of Chicago Booth working paper found experienced developers were 5-6% more likely to successfully use AI agents for every standard deviation of work experience. Not because they typed better prompts—because they used “plan-first” approaches, laying out objectives and steps before invoking AI.

Expertise improves the ability to delegate. That’s not something you learn from an hour of daily experimentation.

What’s Irreducibly Human

During a recent knowledge base project—120 commits over 9 days, roughly 90% Claude-assisted—I tracked where my time actually went.

The 12 human-only commits weren’t about implementation. They were:

Configuration tweaks requiring domain knowledge (model selection for specific use cases)
Debug logging when something felt wrong
Release management
One research document on architecture options

The human contributions were about judgment. Choosing the right model for email writing versus general queries. Knowing when the AI’s suggestion would create problems downstream. Understanding the client’s actual workflows well enough to structure the system appropriately.

What surprised me: the time savings didn’t come from faster typing. They came from eliminating iteration cycles between “write code” and “realize it doesn’t fit requirements.” Specifying clearly upfront meant fewer rewrites—but that specification work was irreducibly human.

The Qodo 2025 State of AI Coding survey found 65% of developers cite missing context as the primary barrier to shipping AI code without review. That “missing context” is exactly what Shumer’s advice doesn’t address:

Business model specifics: How supplier relationships actually work. Which data matters for a specific service model. Why certain integrations take priority.

Organizational constraints: Budget limitations. Timeline pressures. The technical capabilities of staff who’ll maintain the system.

Historical context: Why previous approaches to similar problems failed. What the client tried and rejected. Political dynamics around adoption.

None of this lives on the public web. It exists in Jira tickets, PowerPoint decks, Slack conversations, and institutional memory. You don’t acquire it through an hour of daily experimentation.

Be a Pattern Master, not a Canut

Pattern Masters, Not Refugees

Shumer frames AI as something happening TO workers. The response he offers is defensive: prepare for impact, build reserves, hope the wave passes.

The Jacquard lesson suggests a different frame. AI is changing WHAT the work is. The response isn’t preparation for displacement—it’s understanding what becomes valuable when implementation gets automated.

Two paths survived the Canut transition:

Pattern designers who translated vision into punch cards the loom could execute. Not thread manipulators—system architects who understood what patterns were possible and how to specify them precisely.

Loom improvers who made the infrastructure more reliable. Not operators—engineers who fixed the tension systems, improved card durability, figured out how to chain looms for industrial-scale production.

The agentic coding transition has the same split.

You can master the patterns—writing specs, orchestrating agents, supervising output. Or you can improve the looms—build the MCP servers, write the orchestration layers, optimize the vector databases that make retrieval-augmented generation work at scale.

Both roles survive. Thread manipulation shrinks in economic value.

Methodology Beats Stockpiling

The people seeing productivity gains from AI tools aren’t spending an hour daily experimenting. They’re developing methodology.

Microsoft Research field experiments across nearly 5,000 developers found 26% productivity gains with AI coding assistants—with less experienced developers showing higher adoption and greater improvements. But those gains came from structured workflows, integrated tooling, and verification processes—not casual usage.

The developers in the METR randomized controlled trial who were 19% slower with AI assistance? They were using AI the Shumer way—open a chat, ask a question, accept the output, repeat. No structured context. No optimized prompts. No verification layer. They felt faster while actually slowing down.

I’ve written about three golden rules that structure my own workflow: let AI write your prompts (research shows 17-50% improvement), make context searchable rather than just present (Lost in the Middle kills accuracy), and build verification into every workflow.

This isn’t what you learn from an hour of daily experimentation. It’s what you develop through deliberate methodology applied to real work with real stakes.

The Question Shumer Should Have Asked

The question isn’t “will I have a job in five years?”

It’s “what does my job become when implementation gets automated?”

For senior engineers, the answer is increasingly clear: subject matter expert plus systems architect. The person who translates ambiguous requirements into precise specifications. The person who identifies when agent outputs miss critical organizational context. The person who maintains system coherence across automated development workflows.

Jue Wang at Bain told MIT Technology Review that developers already spend only 20-40% of their time coding. The rest goes to analyzing problems, customer feedback, product strategy, administrative tasks.

AI doesn’t change what senior engineering is. It reveals what it always was.

The implementation layer was never the irreducible core. It was infrastructure—important, but increasingly invisible. What emerges when that layer automates is something both familiar and different. The same judgment work senior engineers always did, now concentrated and visible.

This looks nothing like Matt. Well. Maybe a bit.

The Year of Software

Here’s what Shumer’s displacement framing completely misses: the long tail.

My friend Matt Rosenberg has zero software development experience as a builder. He’s a marketer who also manages a vacation rental property on Cape Cod. Over the past few weeks he built himself a revenue optimization tool—a proper one, with dynamic pricing recommendations based on local events, seasonal patterns, and competitor analysis.

He didn’t hire a developer. He didn’t buy enterprise software designed for property management chains. He built exactly what he needed for his specific situation.

Here’s what matters about how he got there: Matt spent hours over several weeks doing a deep dive into the tools. Not casual experimentation—serious investment in understanding what AI coding assistants could and couldn’t do. His transformation from marketer to builder wasn’t magic. It was built on two things he already had: deep knowledge of UX from his marketing career, and years of accumulated expertise in the vacation rental space.

That combination—domain expertise plus serious tool investment—is a model for career transformation. Not everyone will replicate Matt’s results. But the path he took isn’t “spend an hour a day experimenting.” It’s “leverage what you already know deeply, and invest real time in learning to express it through new tools.”

Here’s the economics that matter: Matt has a reasonable chance to recoup his effort by sharing this with other Cape Cod hosts—a small audience with the exact same problem. A smaller but real chance someone picks it up for broader distribution. Maybe it stays a side project. Maybe it becomes a micro-business serving vacation rental owners in seasonal markets. A larger concern would never take on a project with such a small TAM.

None of those paths existed before.

In the old model, Matt’s revenue optimizer would never exist. No developer would build it for one vacation rental property. No SaaS company would target Cape Cod vacation rentals as a market segment. The problem was real, Matt’s domain expertise was real, but the economics of software creation didn’t work.

Now they do. And so do the economics of software distribution. The same tools that let Matt build also let him iterate based on feedback from ten other hosts, add features they need, package it for sharing.

This is the year of software. Not because developers are being displaced—because software is finally reaching the long tail of problems that were never economical to solve. The domain expert who understands Cape Cod rental patterns better than any enterprise vendor can encode that knowledge into a working system and find the small audience that needs exactly that.

Shumer sees AI automating existing jobs. He misses AI creating new economic paths for people who were never developers in the first place.

The Canuts didn’t just become pattern masters and loom mechanics. Some became textile entrepreneurs who could suddenly afford custom patterns for small-batch production. The technology didn’t only change who did the work—it changed what work was possible.

This is the new normal if you learn to use the tools.

The Bottom Line

Shumer’s right that something big is happening. The capability threshold is real. The timeline is compressed. The transformation will affect every knowledge worker who touches a computer.

He’s wrong about what to do.

The response isn’t defensive preparation for displacement. It’s understanding what becomes valuable when AI handles implementation. It’s developing methodology for specification and orchestration. It’s acquiring the domain expertise and organizational context that AI can’t access.

And for the Matt Rosenbergs of the world—the domain experts who never learned to code—the response is recognizing that this is their year. The problems they understand better than anyone can finally become software.

Don’t stockpile. Don’t experiment an hour a day. Don’t prepare to be a refugee.

Become a pattern master. Or become a loom improver. Or become the domain expert who finally builds the tool that only you could specify.

The Canuts who survived didn’t out-weave the machines. They recognized that the job had changed shape and positioned themselves for what actually remained valuable.

The transformation is underway. The question isn’t whether you’ll make it through. It’s whether you’ve recognized what the job is becoming—and what new jobs are becoming possible.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on the pattern master thesis, read my analysis of the Jacquard loom lesson or my deep dive into what remains irreducibly human.

Related reading:

Don’t Be a Canut—Be a Pattern Master - The Jacquard history with data
The Irreducibles: What a Pattern Master Does - Where human value actually sits
HyperDev’s Three Golden Rules - Methodology for professional AI work

Stack Overflow Is Dead

Robert Matsuoka — Thu, 12 Feb 2026 13:30:43 GMT

Stack Overflow vs Reddit, Discord, and Dev.To

TL;DR

Stack Overflow’s question volume collapsed ~95% from peak—back to 2008 levels. Traffic down ~75%. The archive remains valuable; new contributions have largely stopped.
The decline started in 2018—four years before ChatGPT. The model was already broken; AI accelerated an existing trend.
Developers didn’t stop communicating—they migrated to Reddit, Discord, Dev.to, and AI tools. r/programming has 5-6 million members; Discord has 200M+ monthly active users.
The key insight: Stack Overflow optimized for definitive answers—exactly what LLMs do well. Reddit/Discord provide discussion, opinion, validation—what LLMs struggle with.
Transactional Q&A platforms are vulnerable. Community-first platforms are thriving. This is unbundling, not death.

Stack Overflow still gets read. But it stopped getting written.

Question volume has collapsed from 200,000/month at peak to under 10,000 today—a 95% drop. That’s not a community fading, it’s a Q&A product being outcompeted.

Peter Coy wrote a piece in the New York Times recently arguing this signals the end of developer knowledge-sharing. Developers used to share publicly; now they ask ChatGPT privately. “A little sad,” he called it.

I think Coy has it backwards. Developers aren’t talking less—they’re talking elsewhere. The activity migrated to Reddit, Discord, and AI tools. Stack Overflow’s death isn’t about lost community. It’s about an obsolete model being replaced by better ones.

The collapse is real

Let me be clear: Stack Overflow really is dying. By “dying” I mean new contributions—questions, answers, edits—have collapsed. The archive remains; the community doesn’t.

The numbers are stark. Traffic has collapsed roughly 75% from peak, according to third-party analyses like ByteIota (based on SimilarWeb estimates). Question volume tells an even starker story: Stack Exchange Data Explorer queries show monthly questions dropping from ~200,000 at the 2014-2017 peak to under 10,000 by late 2025. That’s back to 2008 levels—the site’s launch year.

I used Stack Overflow heavily for years. Seeing activity fall back to early-2008 levels is hard to overstate.

Fifteen years of growth erased in under three years.

The paradox is that 84% of developers still browse Stack Overflow. The archive has value. But almost nobody contributes anymore. The site has become a museum—visited, but not lived in.

Where developers went

Here’s what the “developers stopped talking” narrative misses: they moved.

Reddit exploded. r/programming has somewhere between 5-6 million members. r/learnprogramming has around 4 million. Both subreddits are trending at over 1,000% on growth metrics, adding thousands of subscribers daily. These aren’t ghost towns—they’re thriving.

Discord expanded far beyond gaming. The service now has over 200 million monthly active users, and developer communities have exploded. Reactiflux (the React community) has 230,000+ members. Python Discord, Rust Discord, and dozens of framework-specific servers have become the default place for real-time developer discussion.

Dev.to grew to millions of members. Built on community-first principles with lower barriers to participation than Stack Overflow ever had.

The MCP ecosystem exploded. MCP—the Model Context Protocol—lets AI assistants call external tools: APIs, databases, services. Think of it as giving Claude or ChatGPT hands instead of just a mouth. In November 2024, there were maybe 100 MCP servers. By February 2026, over 17,000. A new form of executable knowledge-sharing emerged in the time it took Stack Overflow to collapse.

Developers didn’t stop communicating. They stopped using Stack Overflow.

Why Reddit thrives while Stack Overflow dies

Here’s a clarifying question: if LLMs killed Stack Overflow, why didn’t they kill Reddit?

The answer reveals the real dynamic at play.

Stack Overflow optimized for definitive answers. One question, one accepted answer, close the duplicates, move on. The entire system was designed to produce canonical, searchable, authoritative responses to technical questions.

That’s exactly what LLMs do. Better. Faster. Without the closure votes and downvotes.

Reddit optimizes for discussion. There’s no “accepted answer.” The same question can be asked repeatedly without getting closed. People share opinions, debate tradeoffs, validate frustrations, and build community around shared interests.

LLMs struggle with that. Try asking Claude or ChatGPT “Is this framework actually good or does it just have good marketing?” You’ll get a balanced, diplomatic non-answer. Ask Reddit and you’ll get thirty developers telling you exactly what they think, with war stories and receipts.

Factor Stack Overflow Reddit Content model Q&A with “correct” answers Discussion threads Duplicate policy Aggressively closed Tolerated, repeated Reputation system High-stakes rep + privileges Lighter-touch karma Engagement type Transactional (get answer, leave) Conversational (participate, stay) AI competition Direct replacement Complementary

The “site:reddit.com” phenomenon tells the story. Users increasingly append that modifier to Google searches specifically because they want human perspectives, not AI-generated summaries or SEO-optimized content farms. They’re actively seeking out the thing LLMs can’t easily provide.

Stack Overflow competed with AI on AI’s home turf. Reddit doesn’t.

Why developers didn’t fight for it

The product model explains the competitive pressure. But the culture explains why developers didn’t fight to save it.

Stack Overflow’s culture was broken long before ChatGPT arrived. Look at the question volume data: the decline started in 2018—four full years before ChatGPT launched. Monthly questions dropped from 200,000 to 140,000 before GPT-3’s 2020 release, and well before ChatGPT’s late 2022 launch. The trajectory was already set.

ChatGPT didn’t kill Stack Overflow. It was the final nail in the coffin.

In 2019, Stack Overflow surveyed its own community about a much-publicized initiative to improve culture. Seventy-three percent of respondents said the site remained “equally unwelcoming” compared to before the initiative. This wasn’t outside criticism—it was the community itself acknowledging the problem hadn’t been fixed.

Anyone who’s used the site knows what this looked like in practice. You’d ask a question, spend twenty minutes crafting it carefully, and within seconds someone would mark it as a duplicate of a vaguely related question from 2014. Or close it as “not a real question.” Or downvote it without explanation.

The reputation system created perverse incentives. High-rep users had the power to close questions, and the system rewarded fast closure and strict gatekeeping over patient explanation. New users learned quickly that asking questions was a minefield. The site optimized for the archive, not for learning.

Public disputes between moderators and leadership became common. Several moderators resigned, citing disagreements over governance and feeling unsupported by the company.

To be clear: Stack Overflow did things well. The archive is genuinely valuable—24 million questions and answers representing collective knowledge. Discoverability was excellent. The structured Q&A format created stable, linkable URLs. Canonical answers for common problems saved countless hours.

But the community that created those answers? That was poisoned years ago.

LLMs didn’t kill Stack Overflow. They just offered developers an alternative that didn’t make them feel stupid for asking questions.

What happened to everyone else

Stack Overflow isn’t the only platform affected by this shift. Here are some others:

Platform Status Model AI Vulnerability Stack Overflow Collapsing Transactional Q&A Direct replacement Experts Exchange Pivoting Paywalled Q&A High Quora Struggling General Q&A High Reddit Thriving Discussion Low Discord Thriving Real-time community Low Dev.to Thriving Community blogging Low Hacker News Stable Curated discussion Low

Experts Exchange pivoted hard. Remember them? The original “answers behind a paywall” site that Stack Overflow was created to replace? They’re still around, now positioning themselves as “the home of human intelligence.” The anti-AI angle is their entire pitch now.

The pattern: community-first platforms survive. Transactional Q&A platforms are vulnerable. If your model is “user asks question, platform provides answer,” you’re competing directly with AI. If your model is “users discuss, debate, and build relationships,” you’re not.

The new knowledge architecture

What’s replacing Stack Overflow isn’t a single platform. It’s a layered ecosystem.

Layer 1: LLMs for basic questions. “How do I parse JSON in Python?” Don’t post that anywhere—just ask Claude. Faster response, no judgment, no risk of being marked as a duplicate. Eighty-four percent of developers now use AI tools. For basic technical questions, this is often faster and lower-friction than posting ever was.

Layer 2: MCP servers for executable knowledge. This is the part most people haven’t caught up to yet. The Model Context Protocol ecosystem has exploded to 17,000+ servers, with backing from the Linux Foundation and adoption by major players. These aren’t just answers—they’re capabilities. Instead of reading how to do something, you get a tool that does it. Knowledge that executes.

Layer 3: Communities for discussion. Reddit, Discord, Dev.to. When you need opinions, validation, or to talk through a problem with humans who’ve been there, this is where you go. LLMs can tell you how to use a library; humans tell you whether you should.

Layer 4: Deep expertise for analysis. Blogs, Substacks, video courses, conference talks. Long-form content that explores ideas in depth, with personality and opinion. This is where the experienced practitioners share hard-won knowledge that doesn’t fit a Q&A format.

Here’s what my workflow looks like now: basic syntax question → Claude. Need to connect to an API → MCP server. “Is this the right architectural approach?” → Reddit or Discord. Deep dive on tradeoffs → find a practitioner’s blog post.

This architecture is more sophisticated than Stack Overflow ever was. It’s specialized, distributed, and each layer does what it’s good at. The Q&A site tried to be everything; the new ecosystem lets each component excel at its purpose.

The counter-arguments

Fair criticism exists. Let me address it directly.

“Knowledge is fragmented now.” True. Your answer might be on Reddit, Discord, a GitHub issue, a blog post, or an MCP server. That’s friction Stack Overflow didn’t have. But this is the story of the entire internet—from centralized portals to distributed everything. We adapted before; we’ll adapt again.

“We’re losing archival permanence.” Discord conversations disappear. Reddit threads get buried. The 24 million Stack Overflow Q&As were searchable and permanent. This is a real loss. But the community that created those answers was already gone. The archive remains; the contribution stopped years ago.

“Developers are talking differently, not more.” Probably fair. I can’t prove total knowledge exchange increased. What I can show is that multiple platforms are thriving while Stack Overflow collapses. The activity went somewhere.

“Quality control without voting?” Reddit has karma. Discord servers have curation. LLMs let you iterate until you get a useful answer. None of these are perfect, but neither was Stack Overflow’s system—which surfaced answers based on who posted first and had the most reputation.

The bigger picture

Stack Overflow was a toll booth on the highway of developer knowledge. For a decade, if you wanted an answer to a programming question, you went through the booth. You tolerated the closure votes, the duplicate flags, the reputation games, because there wasn’t a better alternative.

LLMs removed the toll booth.

Developers didn’t stop traveling. They stopped paying.

What we’re witnessing isn’t the death of developer communication. It’s the unbundling of a monopoly. Stack Overflow tried to be the single source of truth for all technical questions. That model was always fragile—it just took a sufficient technological shock to reveal it.

The new ecosystem is messier. More distributed. Harder to search—though agentic coding infrastructure is changing that quickly. (I’d argue you could learn nearly as much from my mcp-skillset as from Stack Overflow. Better organized, semantic search, also built from community contributions. Without the BS.) But it’s also more human, more specialized, and better matched to how people actually learn and communicate.

The sentiment shift

One last data point. The 2025 Stack Overflow Developer Survey showed 84% of developers using AI tools—but sentiment was mixed. Only 60% viewed AI positively, down from over 70%. Forty-six percent actively distrusted AI accuracy.

That was 2025. Since then, models like Claude Opus 4.5 have made the generative AI question moot. The accuracy concerns that fed developer skepticism are evaporating. When an AI tool can reliably write, debug, and ship production code, the “will I use this?” question becomes “how do I use this effectively?”

The holdouts are running out of reasons to hold out.

Stack Overflow’s collapse isn’t a tragedy. A platform that optimized for definitive answers got replaced by tools that provide them faster. The discussion, community, and deep expertise went elsewhere—to platforms that were better at providing those things all along.

Good riddance. The future is tools for answers and humans for judgment.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on how AI is reshaping developer tools, read my analysis of Your IDE Is a Comfort Blanket or The Age of the CLI. Hat tip to Alex Zoghlin for sharing Peter Coy’s article.

Your IDE Is a Comfort Blanket

Robert Matsuoka — Tue, 10 Feb 2026 13:31:36 GMT

The IDE Comfort Blanket

TL;DR

• Traditional IDE features (autocomplete, debugging, refactoring) actually help developers learn, according to a controlled UC San Diego eye-tracking study. The IDE itself isn’t the villain.
• AI code generation layered on top of IDEs is producing the first measurable evidence of cognitive atrophy: 19% slower with AI tools while believing they were 20% faster (METR study), 41% more bugs (Uplevel), 30% more static analysis warnings (Carnegie Mellon).
• The perception-reality gap spans roughly 40 percentage points. Developers, external ML experts, and managers all predicted AI would speed things up. Everyone was wrong in the same direction.
• The deeper problem isn’t deskilling experienced developers. It’s “never-skilling” an entire generation whose learning period coincides with ubiquitous AI assistance. HackerRank reports lead developer hiring grew 22% YoY while entry-level grew only 7%.
• CLI-based agentic tools and Specification-Driven Development (SDD) / Ticket-Driven Development (TkDD) workflows force a different cognitive mode: thinking before coding, specifying intent, supervising execution. The IDE’s tight feedback loop encourages the opposite: accept, move on, don’t think too hard.

I’ve been thinking about something that keeps showing up in my conversations with CTOs and engineering leads. It usually sounds like this: “My senior engineers love Claude Code. My mid-level engineers refuse to leave Cursor. And my juniors can’t function without inline AI suggestions.”

That pattern bothered me enough to spend time researching the issue. What I found changes how I think about the CLI-vs-IDE debate. Turns out it isn’t about tool preferences or terminal elitism. The question is whether the comfortable, suggestion-rich environment of modern IDEs is actively undermining the cognitive skills that matter most in an era of agentic AI.

Here’s what the evidence says: it depends on which layer you’re talking about. And the distinction matters more than most developers realize.

The IDE didn’t make you dumb (but it set the stage)

Let’s start with what might be a surprising finding. A 2024 UC San Diego study ran a between-subjects experiment with 32 programmers using an unfamiliar Gmail Java API. Participants with traditional autocomplete enabled scored significantly higher on post-study knowledge tests (mean ~38 vs ~32 points, p ≈ 0.0079) and completed tasks 8.2% faster. The learning benefit was equivalent to roughly 7.2 years of programming experience.

That’s a big deal. Traditional autocomplete works like a searchable index. It presents options. You still choose, you still think. The study found autocomplete didn’t even reduce keystrokes significantly. Its value came from serving as an efficient information-delivery mechanism, cutting documentation reading time by 16 minutes.

So, to the the surprise of this crotchety “learn-coding-before-the-IDE” developer, the IDE itself isn’t the problem. IntelliSense, syntax highlighting, integrated debugging, refactoring tools: these function as cognitive augmentation. They help you find information faster while you’re building understanding. The ACM published this research and the authors were careful to distinguish between these features and what came next.

The warning they included reads like prophecy now: “As AI-based autocomplete tools, such as Copilot, become more popular, it will be important to re-evaluate the learning implications, since these tools may reduce the cognitive involvement of programmers.”

That re-evaluation has arrived. And the results are ugly.

AI code generation crosses the line

James Prather’s research group ran 21 laboratory sessions with eye-tracking at ICER 2024, studying how novice programmers interact with generative AI tools. They found something concerning: AI tools compound existing metacognitive difficulties and introduce entirely new failure modes they labeled Interruption, Mislead, and Progression.

Students with strong metacognitive skills benefited from AI assistance. Students who already struggled were pushed further behind, finishing with what the researchers called an “illusion of competence.” They believed they understood code they couldn’t reproduce independently. The gap between strong and weak learners widened. That’s the opposite of what educational tools are supposed to do.

A Stanford security study found developers using AI assistants wrote significantly less secure code and were simultaneously more confident it was secure. Let that sink in. Worse outcomes. Higher confidence.

Research on Copilot-generated code in GitHub projects found that 48%+ of AI-generated code contains security vulnerabilities. Actual CWEs in production repositories.

The 40-point perception gap

The single most striking finding comes from METR’s 2025 randomized controlled trial. Sixteen experienced open-source developers tackled 246 real-world tasks on mature repositories (averaging 22,000+ stars and a million lines of code) using Cursor Pro with Claude 3.5/3.7 Sonnet.

The result: developers were 19% slower with AI tools. Before starting, they predicted a 24% speedup. Afterward, they still believed they’d been 20% faster. External ML experts had predicted a 38% speedup. Everyone was wrong in the same direction, and the gap between perception and reality spanned roughly 40 percentage points.

As Sean Goedecke noted in his analysis of the METR study, this wasn’t about bad tools or bad developers. It was about the cognitive overhead of evaluating, correcting, and integrating AI-generated code eating the time savings from not typing it yourself.

This pattern shows up everywhere you look. Google’s DORA 2024 report, surveying 39,000 professionals, found every 25% increase in AI adoption correlated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. Seventy-five percent of developers reported feeling more productive. The measurements said otherwise.

A Carnegie Mellon difference-in-differences study of 807 GitHub repositories adopting Cursor found a transient velocity spike (3-4x more lines added in month one) followed by persistent quality degradation: static analysis warnings up 30%, code complexity up 41%. GitClear’s analysis of 211 million changed lines found code duplication blocks increased eightfold during 2024 and refactoring declined from 25% to under 10% of changed lines.

More code. Worse code. And developers who thought they were crushing it.

Your brain on autocomplete

The cognitive science explains the mechanism clearly enough. Betsy Sparrow’s landmark 2011 Science paper on “Google Effects on Memory” demonstrated that when people expect future access to information, they have lower recall of the information itself but enhanced recall of where to find it. The internet becomes a transactive memory partner. You remember the path to knowledge, not the knowledge.

Applied to programming: developers who rely on autocomplete may remember that a method exists in the dropdown rather than what it does. For traditional autocomplete, the UC San Diego study suggests this tradeoff is acceptable or even beneficial. But AI code generation goes much further. It doesn’t just tell you what methods exist. It writes entire implementations you may never fully comprehend.

A 2025 study of 666 participants published in MDPI found a significant negative correlation (r = −0.75) between frequent AI tool usage and critical thinking abilities, mediated by cognitive offloading. Younger participants (17-25) showed higher AI dependence and lower critical thinking scores. Higher education served as a protective buffer, but the feedback loop was clear: AI usage increases cognitive offloading, which reduces critical thinking, which increases AI dependency.

Robert Bjork’s concept of “desirable difficulties” provides the theoretical basis for why removing struggle from programming does harm. When you type code from memory instead of accepting a suggestion, you engage deeper encoding through the generation effect. When you debug manually instead of clicking a fix-it button, you build stronger mental schemas. Research on productive failure shows students who struggle before receiving instruction outperform those who receive direct instruction first.

A 2026 Frontiers in Medicine paper distinguishes between “deskilling” (losing existing abilities) and what they call “never-skilling.” That second concept is the one that keeps me up at night. An entire generation of developers whose foundational learning period coincides with ubiquitous AI assistance may never develop the mental schemas that experienced developers take for granted. The FAA now recommends more manual flying to counter autopilot-induced skill decay. Endoscopists using AI for polyp detection saw detection rates drop from 28% to 22% when AI was turned off. This pattern shows up in every profession where automation handles the thinking.

DHH could feel it in his fingers

The industry voices on this are converging, even from people you wouldn’t expect to agree.

David Heinemeier Hansson, the creator of Ruby on Rails, put it viscerally in his 2025 Lex Fridman interview: “I don’t let AI drive my code. I’ve tried that, I’ve tried the Cursors and the Windsurfs, and I don’t enjoy that way of writing. I can literally feel competence draining out of my fingers.” He keeps AI output in a separate window to prevent passive consumption and insists on doing the typing himself because “you learn with your fingers.”

His specific example stuck with me. He discovered he was repeatedly asking AI for the same Bash conditional syntax. By not typing it, he wasn’t learning it. His analogy: “You’re not going to get fit by watching fitness videos. You have to do the sit-ups.”

Casey Muratori, whose “Clean Code, Horrible Performance” video demonstrated up to 15x performance penalties from IDE-friendly abstraction patterns, argues that modern development practices actively pessimize software by hiding how CPUs actually work. His Performance-Aware Programming course exists explicitly to teach what IDE abstractions conceal. Jonathan Blow’s 2019 talk “Preventing the Collapse of Civilization” frames the issue in starker terms: each generation of developers inherits diluted knowledge, and the accumulated abstraction layers represent civilizational risk.

The pragmatic middle ground comes from developers like ThePrimeagen, who built “99,” a Neovim AI plugin explicitly designed for “people without skill issues,” deliberately restricting AI to specific, developer-controlled areas rather than giving it full autonomy. His philosophy: AI assists, it doesn’t replace. Zed Shaw, author of the “Learn Code the Hard Way” series, advises beginners to avoid IDEs entirely during initial learning: “If you take the easy tool-based route, then you’re dependent on the tool you use.”

And HN being HN, one commenter nailed the practical consequence: “If I had a Bitcoin for every IDE superstar programmer who couldn’t navigate his way around the build system, I wouldn’t have to write software for a living.”

The survey data tells the generational story

The Stack Overflow 2025 Developer Survey (49,000+ respondents) quantifies the split. Early-career developers use AI tools daily at 55.5% compared to 47.3% for developers with 10+ years of experience. More significantly, 46% of developers now actively distrust AI output, exceeding the 33% who trust it. Only 3% report high trust. Positive sentiment toward AI tools has declined from over 70% in 2023 to 60% in 2025. Experienced developers are the most skeptical: lowest “highly trust” rate (2.6%), highest “highly distrust” rate (20%).

HackerRank’s 2025 report, drawing on 26 million developers and 3 million assessments, reveals the hiring consequence: lead developer hiring grew 22% year-over-year while entry-level hiring grew only 7%. The report explicitly cites employer concerns about whether early-career developers can code without heavy AI assistance. Stanford data shows employment among software developers aged 22-25 fell nearly 20% between 2022 and 2025.

Meanwhile, JetBrains’ 2025 State of Developer Ecosystem found 68% of developers expect AI proficiency to become a job requirement, and GitHub’s Octoverse 2025 reports that nearly 80% of new developers use Copilot within their first week. AI-assisted coding is the default learning environment for an entire generation, and we have precisely zero longitudinal studies on what that does to skill development.

Remember “vibe coding”? Andrej Karpathy coined it in February 2025: “fully give in to the vibes, embrace exponentials, and forget that the code even exists.” The Stack Overflow survey found 72% of developers say vibe coding plays no role in their professional work. Karpathy himself retreated, admitting his “Nanochat” project was “basically entirely hand-written” because AI agents “just didn’t work well enough.”

The abstraction argument is older than you think (and this time it’s different)

Joel Spolsky’s 2002 “Law of Leaky Abstractions” remains the foundational text: “All non-trivial abstractions, to some degree, are leaky.” His most underappreciated line: “Abstractions save us time working, but they don’t save us time learning.”

Every abstraction boundary in computing history has produced this same anxiety. When FORTRAN arrived in 1957, assembly programmers viewed it as a crutch. Ed Post’s satirical 1983 essay “Real Programmers Don’t Use Pascal” captured the gatekeeping pattern so precisely it reads as prophecy of today’s Vim-vs-IDE debates. Each transition involved genuine loss of low-level understanding, genuine productivity gains, gatekeeping rhetoric from incumbents, and eventual normalization.

But the calculator analogy that defenders of AI coding tools love to invoke breaks down on closer inspection. A 2003 meta-analysis of 54 studies found calculator use did not hinder mathematical skill development. However, as cognitive scientist Amy Jo Ko argues, LLMs differ because they replace entire cognitive processes, not just computation. Calculators don’t hallucinate. They don’t generate plausible-but-wrong solutions. The better analogy would be handheld calculators that routinely display 2+2=5 with complete confidence.

The strongest counterargument is that every prior generation of abstraction-skeptics was ultimately wrong. The FORTRAN skeptics lost. The structured programming skeptics lost. IDE skeptics, broadly, lost. JetBrains’ learning curve survey found learners who use IDEs encounter fewer obstacles, get stuck less often, and handle version control more easily. But the question isn’t whether IDEs helped. It’s whether AI code generation is another step on the same escalator or a qualitatively different kind of abstraction that crosses from augmenting cognition to replacing it.

Based on what I’ve seen in the research, I think it’s the latter. And I think the IDE is where developers are most likely to experience this crossing without noticing it.

Why this matters for CLI and specification-driven workflows

Here’s where this connects to the broader argument I’ve been building about CLI-based agentic tools and the shift toward Specification-Driven Development (SDD).

IDE-integrated AI tools optimize for the wrong cognitive mode. They sit inside your editor, constantly suggesting, constantly completing, making it effortless to accept code you haven’t thought through. The tight feedback loop that makes IDEs feel productive is the same loop that enables the accept-and-move-on behavior the research keeps flagging. The entire UX is designed to keep you writing code faster, not thinking about code more carefully.

CLI-based agentic tools and SDD/TkDD workflows force a different cognitive mode entirely. When you’re working with Claude Code from the terminal, you can’t just tab-accept a suggestion mid-line. You have to think about what you want before you ask for it. You write specifications. You decompose tasks into tickets. You define acceptance criteria. Then you supervise execution and review results.

This is where I see the distinction between SDD and what I call Ticket-Driven Development (TkDD). SDD gives you the strategic framework: specs are the primary artifact, not code. TkDD gives you the workflow mechanics: every unit of work lives in a ticket that captures not just the requirement but the evolution of thinking during human-AI collaboration. The ticket becomes the forcing function that makes you articulate intent before the agent writes a single line.

I use mcp-ticketer for this daily. Before an agent touches code, I’ve written a ticket that specifies what I want, why I want it, what the constraints are, and how I’ll know it’s done. That process of articulation is exactly the “desirable difficulty” that Bjork’s research says builds deeper understanding. It’s the sit-ups DHH was talking about. The IDE workflow lets you skip the sit-ups. TkDD makes you do them.

Think about Boris Cherny’s workflow. The Anthropic staff engineer who created Claude Code uses Plan mode to iterate on architecture until satisfied, then switches to auto-accept mode where Claude “can usually 1-shot it.” He runs 10-15 concurrent sessions. That’s not editing code in an IDE. That’s specifying, supervising, and reviewing. The cognitive work happens before the code exists, not while it’s being suggested to you inline.

Multi-agent orchestration amplifies this effect. When you’re running five Claude Code instances across git worktrees through tmux, your job is architectural thinking and quality review. There’s no autocomplete to accept. There’s no inline suggestion to wave through. You’re operating at the specification and supervision layer, which is precisely the cognitive level the research says matters most for building and maintaining deep understanding.

The line between tool and crutch

The evidence doesn’t support the claim that IDEs themselves have made software engineers less capable. Traditional IDE features function as cognitive augmentation that helps developers access information faster while building understanding.

What the evidence does support, with increasing conviction, is that AI code generation represents a qualitative break from prior tooling. The convergence of findings (19% slowdown masked by perceived speedup, 41% increase in bugs, 30% rise in static analysis warnings, a −0.75 correlation between AI usage and critical thinking, and the widening gap between strong and weak learners) points to a tool category that degrades understanding while creating an illusion of competence.

The danger isn’t that experienced developers will forget how to code. It’s that the next generation won’t learn how to think about code at a level deeper than “accept suggestion.” IDEs don’t cause this problem, but they’re the delivery mechanism. The inline, always-on, friction-free suggestion environment of modern IDE-based AI is precisely optimized to bypass the cognitive processes that build expertise.

CLI-based agentic workflows and the SDD/TkDD paradigm aren’t just different tools. They’re different cognitive modes. They require you to think before you prompt, specify before you execute, and review with genuine comprehension rather than a quick scan of inline diffs.

That’s not terminal elitism. That’s responding to what the research actually shows: the developers who’ll thrive in the next phase are the ones who can think at the level of specifications, architecture, and agent supervision. Not the ones who got really fast at pressing Tab.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev.

Related reading:

The Other Shoe Will Drop — The economics of AI-assisted development and specification-driven workflows
What’s In My Toolkit: August 2025 — My daily CLI-based agentic workflow
TkDD: Ticket-Driven Development — Why tickets are the forcing function for AI collaboration
The Age of the CLI, Part 2 — From nanny coding to fire-and-check-in

Breaking: Opus 4.6 and Agent Teams

Robert Matsuoka — Thu, 05 Feb 2026 23:56:47 GMT

Anthropic dropped Opus 4.6 today (February 5, 2026), and buried in the announcement is what could be the most significant development in agentic coding since Claude Code launched: Agent Teams.

Multiple Claude instances working in parallel on a shared codebase, coordinating autonomously, with no active human intervention.

I haven’t tested it yet. But based on what Anthropic published today, this expands the ceiling of what’s possible with AI-powered development.

What Agent Teams Actually Is

Agent Teams is a research preview feature in Claude Code that lets you run multiple Claude instances simultaneously, each working on different aspects of a project.

The architecture (official docs):

One lead session coordinates the work
Multiple agent instances run independently with their own context windows
Shared task list that agents can assign themselves work from
Direct agent-to-agent communication for coordination
Parallel execution on read-heavy tasks like codebase reviews

Enable it with: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

This isn’t subagents (which work within a single session and return results to the parent). These are independent Claude Code sessions that can communicate and coordinate directly.

The C Compiler Stress Test

Before releasing Agent Teams publicly, Anthropic researcher Nicholas Carlini (published writeup) stress-tested the system by tasking 16 agents with building a C compiler from scratch, capable of compiling the Linux kernel.

The results:

Nearly 2,000 Claude Code sessions over two weeks
2 billion input tokens and 140 million output tokens consumed
$20,000 in API costs
100,000 lines of Rust code produced
Successfully compiles Linux 6.9 on x86, ARM, and RISC-V
99% pass rate on most compiler test suites including GCC torture tests
Can compile and run Doom (the ultimate developer litmus test)

Carlini describes this as “clean-room implementation” - Claude had no internet access, only the Rust standard library.

The compiler has limitations (can’t handle 16-bit x86 real mode, cheats by calling GCC for that phase). But the point isn’t whether the compiler is production-ready. The point is 16 AI agents autonomously built a 100,000-line compiler that actually works.

That’s a different order of magnitude than “Claude helped me refactor a module.”

What Changed in Opus 4.6

Agent Teams is the headline, but Opus 4.6 includes several major upgrades:

1M Token Context Window

First time for Opus-class models. Not just “more tokens” - the retrieval quality matters more than the capacity. On MRCR v2 (finding specific information buried in massive context):

Opus 4.6: 76.0%
Sonnet 4.5: 18.5%

Context Compaction

For long-running sessions, automatically summarizes older conversation turns to free up context space. Like git squash for conversation history - keeps summaries of earlier work, full detail on recent turns.

Adaptive Thinking with Effort Controls

Four settings: low, medium, high (default), max. Model adjusts reasoning depth based on task complexity. Trade latency and cost for quality when you need it, run faster on simpler tasks.

Same Pricing

$5 input / $25 output per million tokens. Identical to Opus 4.5. (Premium pricing $10/$37.50 for requests over 200K tokens using the full 1M context.)

Available Now, Everywhere

This isn’t an announcement with a waitlist. Opus 4.6 is live:

Claude.ai web interface
Claude API (claude-opus-4-6)
Claude Code (with Agent Teams in research preview)
GitHub Copilot (gradual rollout)
AWS Bedrock
Major cloud platforms

That’s unusual. Most AI releases do staged rollouts. Anthropic shipped everywhere simultaneously.

The Benchmarks (With Usual Caveats)

Anthropic claims (official Opus page) Opus 4.6 leads or matches competitors across most benchmark categories:

Terminal-Bench 2.0 (agentic coding): 65.4% - Anthropic’s highest score GDPval-AA (real-world professional tasks): 144 Elo points ahead of GPT-5.2 Humanity’s Last Exam (complex reasoning): Leads all frontier models BigLaw Bench: 90.2% - highest score from any Claude model Zero-day vulnerability discovery: 500+ previously unknown high-severity vulnerabilities found in open-source code

Standard disclaimer: Anthropic’s benchmarks, specific test sets, real-world performance varies.

Worth noting: OpenAI released GPT-5.3 Codex 27 minutes after Opus 4.6’s announcement, claiming 77.3% on Terminal-Bench 2.0. The benchmark lead didn’t last half an hour.

What This Could Mean for Agentic Development

If Agent Teams works as described, it could change the fundamental economics of AI-assisted development.

Before: One agent, sequential processing. Ask it to review a PR, it goes file by file.

After: Multiple agents working in parallel. One reviews frontend, one reviews API, one checks tests, one updates documentation - simultaneously.

The cost structure changes too. Agent Teams bills each instance separately, so you’re paying for multiple concurrent sessions. But if three agents working in parallel complete a task in one-third the time, the token economics might still favor the parallel approach.

The real question: Does coordination overhead eat the parallel gains?

With human teams, adding developers to a project doesn’t scale linearly - coordination costs increase. Brooks’ Law: “Adding manpower to a late software project makes it later.”

Do AI agent teams suffer the same coordination penalties, or do they coordinate more efficiently than humans?

I don’t know. I haven’t tested it yet.

The OpenAI Response

Twenty minutes after Anthropic announced Opus 4.6, OpenAI released GPT-5.3 Codex - a specialized developer-focused model.

The timing is either competitive coincidence or coordinated counter-programming. Either way, it signals where both companies see the competitive battlefield: autonomous, multi-agent coding workflows.

This isn’t about which model writes better individual functions. It’s about which platform enables teams of AI agents to autonomously execute complex, multi-day software projects.

Security and Safety Considerations

Anthropic published a system card claiming Opus 4.6 has low rates of harmful behaviors and the lowest over-refusal rates of any recent Claude model.

The cybersecurity implications cut both ways. Opus 4.6 found 500+ previously unknown high-severity vulnerabilities (Axios reporting) in open-source code - which is excellent for defenders trying to secure their software. It’s also concerning for what adversaries could do with that capability.

Anthropic developed six new cybersecurity probes to detect potentially harmful uses and implemented real-time detection tools to block suspected malicious traffic. They acknowledge this “will create friction for legitimate research and some defensive work.”

What I’m Testing Next

I haven’t used Agent Teams yet. But here’s what I want to find out:

1. Coordination efficiency

Do agent teams actually complete complex tasks faster than sequential agents, or does coordination overhead cancel the parallel gains?

2. Cost dynamics

At what task complexity does paying for multiple concurrent sessions become economically viable compared to one longer session?

3. Integration with Claude-MPM

I already orchestrate multiple specialized agents through Claude-MPM. How does Agent Teams interact with or replace that orchestration layer?

4. Task decomposition quality

How well does the lead agent break down complex work into parallelizable subtasks? Does it create artificial dependencies or find genuine parallelism?

5. Conflict resolution

What happens when multiple agents need to modify the same files? How does the system handle merge conflicts and coordination failures?

I’ll report back once I’ve actually run Agent Teams on real projects.

The Bottom Line

Opus 4.6 represents a major capability expansion in agentic coding. Agent Teams raises the ceiling on what’s possible - from “AI helps me code” to “AI agent teams autonomously execute multi-week software projects.”

Whether it actually delivers on that potential in production environments remains to be seen. The C compiler demonstration is impressive, but building a greenfield compiler is different from maintaining a legacy enterprise codebase with 15 years of technical debt.

The real test: Can Agent Teams handle the messy, ambiguous, poorly-documented, politically-fraught reality of actual software development? Or does it only shine on clean-room projects with clear specifications?

I’ll find out. But the potential is clear: we’ve moved from individual AI coding assistants to coordinated AI development teams.

This release is not an incremental improvement. This could be a significant shift in how autonomous development works.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev.