Hyperdev

What’s In Your Second Brain?

Robert Matsuoka — Mon, 04 May 2026 13:26:26 GMT

The modern CTO toolkit isn’t just apps and coding tools. The real differentiator is a custom knowledge layer — databases, search indices, memory graphs, behavioral instructions that compound over time. No product gives you this. You build it.

Andrej Karpathy gestured at something similar last month when he posted a GitHub Gist he called “LLM Wiki.” His framing: stop using LLMs just to write code, use them to build and maintain a personal knowledge base instead. “Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.” Three folders, structured Markdown, a large context window, a few Python scripts. No RAG, no vector database. He concluded with: “I think there is room here for an incredible new product.”

He’s right that there’s room. But the product comment is where I’d push back, and I’ll get to that. What Karpathy is describing isn’t a note-taking system. It’s a personal operational knowledge layer. For CTOs specifically, that layer needs to be broader than a personal wiki — it needs live organizational data, agent-connected search, and context that persists across months of decisions. No app hands you that.

TL;DR

Karpathy’s LLM Wiki shows the direction: LLMs as knowledge compilers, not just code generators
A modern CTO’s “second brain” is more than PKM — it’s live databases, custom agents, and contextual search across organizational data
When I joined Duetto as CTO, my custom toolkit let me synthesize a 150-person R&D org in weeks instead of months
The power isn’t Obsidian. It’s what you connect to it — MCP servers, search indices, knowledge graphs
Productizing this is theoretically possible and practically very hard, because the schema is the moat

The toolkit article got it half right

In What’s In My Toolkit: Claude Code and Family, I wrote about vanilla Claude Code’s core limitations: context evaporates, code search is keyword-based, memory doesn’t persist, execution is single-threaded. The tools I built — Claude MPM, mcp-vector-search, kuzu-memory — address each of those gaps.

But that article was about coding workflows. The real story is broader.

The same architecture that makes a coding session more effective — persistent memory, semantic search, specialized agents pulling from structured data — turns out to be extraordinarily useful for executive work. Understanding an organization, tracking decisions over time, querying data across systems, maintaining context across months of meetings and analysis. The toolkit I built for software development became the toolkit I used to onboard as a CTO.

That onboarding story is documented in detail elsewhere. Short version: I pointed a multi-agent framework at GitHub, JIRA, Slack, Confluence, and budget spreadsheets, and synthesized a 150-person R&D organization in the weeks before my start date. The difference between doing that with a chat interface versus a CLI-based orchestration layer with parallel agents and persistent memory wasn’t 2x or 5x. It was closer to 10x.

But the onboarding was just the starting gun. The second brain I assembled keeps compounding.

What’s actually in my second brain

Let me be specific. Because when people (now) hear “second brain” they usually think Obsidian vaults with color-coded tags and pretty Markdown files. That’s part of it. It’s the surface layer.

The actual power comes from what’s underneath.

The memory layer

kuzu-memory is a KuzuDB-backed knowledge graph that persists across every AI session. It stores learnings from conversations, code commits, decisions, patterns. When I start a new Claude Code session on a problem I’ve touched before, the context isn’t blank — it’s enriched with what was learned the last time.

This is the thing people underestimate. A project-specific memory that accumulates over months of work develops a kind of organizational intelligence you can’t replicate in a single conversation. It knows why a particular architectural decision was made. It knows that a vendor was evaluated and found lacking. It knows the terminology your team uses internally that differs from industry standard.

KuzuDB isn’t a product choice for its own sake — it’s graph-native, which means it handles relationships well. The connections between people, systems, decisions, and code are as important as the facts themselves.

The search layer

mcp-vector-search provides semantic search across all project files. Not keyword search — semantic search with AST parsing. When I ask “where is the analysis I did on contractor productivity last quarter,” it finds it even if the document never uses those exact words.

At Duetto, this covers everything in my CTO project: architecture records, meeting notes pulled from Granola, emails I’ve synthesized, analysis documents, planning artifacts. Months of accumulated context, all searchable in seconds. The underlying code intelligence for the engineering organization runs as a separate service — mcp-vector-search is for my working knowledge, not the codebase itself.

The databases

My CTO project has three:

cto.db — SQLite. Work classification, people analysis, contributor data, commit history. The operational database for running analyses and reports.
analytics.duckdb — DuckDB. OLAP queries and analytics. When I need to slice engineering output data in different ways or run something that would be painful in SQLite, it goes here.
duetto_knowledge.db — The RAG-queryable knowledge base backing a Flask web app for interactive exploration.

These aren’t a product I bought. They’re a schema I designed, built incrementally, and own completely. The schema reflects how I think about the organization, which is precisely why it’s useful.

The connectors

gworkspace-mcp handles Drive, Docs, Sheets, Gmail, Calendar, and more. I wrote my own rather than using the off-the-shelf options — Google’s first-party integration and Anthropic’s default both have significant tool coverage gaps. Mine exposes substantially more of the Workspace API surface and integrates transparently with Claude MPM, so agents can use Google Workspace tools without any special configuration at the call site.

Beyond Workspace: Notion API for product specs and planning documents. Extraction scripts for JIRA, Confluence, Slack, Datadog, and AWS. Each system outputs to as raw data, which feeds analysis pipelines that generate reports stored in a project directory.

For company-wide memory, two more tools: duetto-memory and duetto-directory. These handle shared organizational context — information that needs to flow between tools and across team members rather than staying in a single session. Memory persists within our VPC, encrypted to individual users’ OAuth keys. Not even our own IT has access to it. Context shared from Claude Code shows up in Claude.ai, and vice versa, without any manual sync.

The entire flow is queryable. From a single Claude session, I can ask about budget trends, team velocity, specific architectural decisions, or what a particular engineer has been working on for the last three months. Because it’s all in the same context-addressable system.

Obsidian as the front door

Yes, I use Obsidian. But it’s a front door, not the building. The vault holds my personal notes, research captures, and synthesized analysis. The Obsidian Web Clipper feeds raw material into the knowledge pipeline. Templates enforce consistent structure.

Karpathy’s insight about Obsidian as IDE is right in the narrow sense: it’s the interface you use to read and organize. But the interesting work happens outside it — in the databases, the agents, the search indices, the custom scripts.

CLAUDE.md files everywhere

The context layer isn’t just data. It’s also behavioral instructions.

Every major directory in my project has a CLAUDE.md. The root CTO project one is 400 lines of conventions, routing logic, document lifecycle rules, and architectural decisions. Every subdirectory has a more focused version. Every specialized agent has its own constraints.

These files are my second brain’s schema, expressed as instructions rather than data. A single routing rule — “if the prompt mentions meeting notes, save to projects/meetings/2026-W##/“ — sounds trivial. But it means twelve months of meeting notes accumulate in consistent, queryable locations rather than wherever an agent happened to save them. Multiply that by forty routing rules across fifteen subdirectories, and the entire corpus becomes navigable. The CLAUDE.md files are what make the databases useful. Without them, the data is just data.

Karpathy put it well: “You share the schema, not the code.” The schema is the valuable part. The schema is what compounds.

My schema took months to build. It will keep getting better. No product ships with the right schema for my organization, because no product knows what I know about how Duetto’s R&D works.

The productization question

Karpathy said there’s room for an incredible product. He’s not wrong about the gap. He might be wrong about the solution.

The structural problems with productizing a second brain:

Context compounds, products don’t. My system gets smarter with every commit, meeting, and conversation. A SaaS product serves thousands of customers and maintains no one’s specific context. The more I use my system, the wider the gap between it and any off-the-shelf alternative.

The schema is the moat. My knowledge architecture reflects how I think about engineering organizations. Someone else’s knowledge architecture would be different. Products that force their schema on you — and every product does — are imposing someone else’s way of thinking on your problem. That friction is small at first and grows over time.

Privacy is structural, not incidental. My databases contain org structures, salary data, performance patterns, vendor negotiations. Routing that through third-party infrastructure creates risk that’s practically impossible to contain. When I built duetto-memory for enterprise use, the entire stack stays within our VPC, with memories encrypted to individual users’ OAuth keys. Not even IT can read them. That level of isolation is nearly impossible to provide as a multi-tenant SaaS.

Some layers could be productized — the infrastructure, not the intelligence. A well-designed memory MCP with sensible defaults. Semantic search that works without configuration. Privacy-preserving graph storage you don’t have to host yourself. The plumbing.

The schema, the decisions, and the accumulated context can’t be productized. Those are yours. That’s the point — and it’s also why the product gap Karpathy sees will remain open even after someone tries to fill it.

Can anyone do this?

There’s an access problem here, and I’d be dishonest not to acknowledge it.

Building what I’ve described requires knowing Python well enough to write extraction scripts, understanding enough about graph databases to design a schema, and being comfortable with CLI-based tooling and MCP server configuration. Not every CTO has that background. Not every technical leader wants to spend weekends building personal infrastructure.

The irony is that the people who most need better organizational intelligence — executives without deep engineering backgrounds — are least equipped to build these systems. And the people who are most capable of building them are often less interested in the executive problems the systems could solve.

Tiago Forte, who wrote Building a Second Brain, has been making this point for years. His PARA method and CODE framework are accessibility layers — ways to make the underlying ideas approachable without requiring you to build a graph database. The methodology is sound. But it was designed for knowledge workers, not for CTOs running engineering organizations who need live data pipelines, not filing systems. Well-designed for whom?

Karpathy’s LLM Wiki is explicitly a system for someone comfortable writing Python and working with file systems. His Gist has code in it. That’s a feature for his audience and a barrier for everyone else.

What I’d watch for

A few trends that will determine whether this remains a DIY space or gets productized:

MCP as infrastructure. The Model Context Protocol creates a standard interface for exactly this kind of knowledge infrastructure. Memory servers, search servers, database connectors — they all expose the same interface to any compatible AI client. The ecosystem is growing fast. As more MCP servers mature, the configuration burden drops.

Searchable context beats raw window size. Karpathy argues for plain Markdown because ~400K words fit in a modern context window. That’s true, and the window is getting larger. But the more important shift is that structured, searchable context doesn’t have a ceiling. A well-organized knowledge base that spans years of meetings, decisions, and analysis delivers more than any single context window can hold — and the value scales with the quality of the organization, not the size of the model.

Local model quality. Karpathy runs Anthropic agents via Claude Code. But local model quality is improving fast. A system that uses the cloud API for synthesis and queries but runs a local model for routine indexing tasks would be significantly cheaper and more private. Not ready yet. Getting closer.

The product Karpathy thinks exists — if it gets built — probably looks like a well-designed local MCP server with clean configuration, sensible defaults, and a plugin ecosystem for connectors. Not a SaaS. Not a cloud database. Something you install and own.

The people who need it most will have already built their own before any product ships. And in the process of building it, they’ll have accumulated the one thing no product can give them: months of their own operational context, organized the way their own mind works.

That’s not a consolation prize. That’s the whole point.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

What’s In My Toolkit: Claude Code and Family — The coding layer of the stack
I Built a Coding Tool. Then I Used It to Onboard as CTO — Applying agent orchestration to organizational analysis
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

You weren't imagining things...Claude Code was dumber this month

Robert Matsuoka — Fri, 24 Apr 2026 13:53:25 GMT

So if you’ve been using Claude Code and noticed it felt... off... you weren’t imagining it.

Anthropic published a full breakdown yesterday and it’s actually three separate bugs that compounded into what looked like one big degradation. The developer community was right to be concerned, and the evidence they collected was instrumental in getting this fixed.

Here’s what actually happened:

1. They silently downgraded reasoning effort (March 4)

They switched Claude Code’s default from high to medium reasoning to reduce latency. Users noticed immediately. They reverted it on April 7.

Classic “we know better than users” move that backfired. From their postmortem:

“This was the wrong tradeoff. We reverted this change on April 7 after users told us they’d prefer to default to higher intelligence and opt into lower effort for simple tasks.”

The UI was appearing frozen in high reasoning mode, so they made an executive decision to sacrifice quality for speed. Developers immediately felt the difference and pushed back hard.

2. A caching bug made Claude forget its own reasoning (March 26)

This one was particularly insidious. They tried to optimize memory for idle sessions—clear old thinking after an hour of inactivity to speed up resumption. Sounds reasonable, right?

A bug caused it to wipe Claude’s reasoning history on EVERY turn for the rest of a session, not just once. So Claude kept executing tasks while literally forgetting why it made the decisions it did.

The cascading effects were brutal:

Every request became a cache miss
Usage limits drained faster than expected
Claude appeared “forgetful and repetitive”
Sessions felt like they were constantly resetting

3. A system prompt change capped responses at 25 words between tool calls (April 16)

They added this seemingly innocent instruction: “keep text between tool calls to 25 words. Keep final responses to 100 words.”

It caused a measurable 3% drop in coding quality across both Opus 4.6 and 4.7. They caught this through ablation testing—removing the instruction and measuring the performance difference.

Reverted April 20.

The community evidence was damning

While Anthropic was investigating internally, the developer community was building their own case. Stella Laurenzo from AMD’s AI group published the most comprehensive analysis—6,852 Claude Code sessions and over 234,000 tool calls.

Her findings:

Median visible thinking length collapsed 73% (2,200 → 600 characters)
API calls per task spiked up to 80x from February to March
Claude was choosing “simplest fix” over correct solutions

BridgeMind’s testing showed Opus 4.6 accuracy dropping from 83.3% to 68.3%.

The data was undeniable.

The perfect storm effect

Here’s what made this particularly hard to pin down: all three bugs affected different traffic slices on different schedules. The combined effect looked like random, inconsistent degradation.

Hard to reproduce internally. Hard for users to isolate the exact cause. It just felt... wrong.

Some sessions hit the reasoning downgrade. Others hit the caching bug. The unlucky ones hit multiple issues simultaneously. No wonder it seemed like Claude was having random bad days.

What this reveals about AI product development

This postmortem is actually refreshing in its transparency. Most AI companies would have quietly fixed the issues and moved on. Anthropic owned the mistakes publicly.

But it also highlights a fundamental tension in AI product development: users often prefer maximum capability over convenience optimizations. The reasoning effort downgrade was done for user experience (reduce perceived latency), but developers would rather wait for better output.

The lesson: don’t optimize away what users value most without asking them first.

All fixed now (v2.1.116)

As of April 20, all three issues are resolved:

Default reasoning is now “xhigh” for Opus 4.7, “high” for others
Caching bug squashed
Verbosity limits removed
Usage limits reset for all subscribers

Anthropic is also committing to more transparency going forward with a dedicated @ClaudeDevs account for deeper technical communication with developers.

The community was right to raise hell about this. And Anthropic’s response—full transparency with concrete fixes—sets a good precedent for how AI companies should handle quality regressions.

Your coding assistant is back to full strength.

Independent Validation

The technical analysis backing this story comes from multiple independent sources. Stella Laurenzo’s comprehensive audit of 6,852 sessions provided the quantitative foundation. BridgeMind’s testing offered controlled benchmark data. These weren’t isolated complaints—they were systematic investigations with reproducible findings.

When a company publishes a detailed postmortem acknowledging specific engineering decisions that degraded their product, and that postmortem aligns with community-gathered evidence, we’re seeing transparency in action. The developer community did the work to document the problems. Anthropic owned the solutions.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Opus 4.6 vs 4.7: The Real Cost of Incremental AI Improvements

Robert Matsuoka — Wed, 22 Apr 2026 11:31:49 GMT

Opus 4.7 dropped last week. Lots of excitement. Then the other shoe dropped. I ran identical coding tasks against both Opus 4.6 and 4.7 to see if the capability improvements justify the cost increase. Both models passed all 10 tests. The quality difference is real — but Opus 4.7 consumed 3.6× more tokens and cost 3.6× more for the same outcome.

That’s not a typo. Same task, same success rate, nearly 4× the cost. I have the receipts.

Anthropic says “pricing unchanged” because the per-token rates stayed the same. What they don’t mention is that Opus 4.7 systematically burns through more tokens to complete identical work. The model writes, then revises. Opus 4.6 writes correctly the first time. Both approaches work. Only one bills you for the revision process.

TL;DR

Both models passed 10/10 tests: Quality improvements are measurable but incremental (better typing, more thorough code)
3.6× cost increase for identical outcomes: $0.38 vs $1.38 for a 30-minute coding task in controlled testing
Token consumption drives cost, not capability: Opus 4.7’s iterative working style consumes 2.9× more output tokens per task
Agentic mode required: One-shot testing shows 4.7 fails 9/10 tests without tool access, while 4.6 passes perfectly
Per-token rates unchanged, real bill moved anyway: 4.7 burns 4.8× more cache tokens per task — the rate card stays flat, your invoice doesn’t

The Controlled Test

Last week I ran a head-to-head benchmark using the same Level 1 coding task for both models: build a complete Python CLI tool (Markdown Table Formatter) from scratch with full test coverage.

Test setup:

Framework: claude_agent_sdk v0.1.64, full agentic mode
Models: claude-opus-4-6 vs claude-opus-4-7
Success criteria: Pass all 10 provided pytest tests
Execution: Concurrent runs with identical prompts

Both models succeeded. The difference was entirely in how they got there.

The Numbers

Metric Opus 4.6 Opus 4.7 Ratio Wall clock time 114.8s 259.1s 2.3× slower Agent turns 17 23 35% more Output tokens 6,384 18,289 2.9× more Cache read tokens 215,853 1,034,165 4.8× more Total cost $0.38 $1.38 3.6× more expensive

The Behavioral Fingerprint

The tool usage patterns reveal why 4.7 costs more:

Tool Opus 4.6 Opus 4.7 Write 6 7 Bash 6 9 Read 4 1 Edit 0 5

Opus 4.7 made 5 Edit calls to revise files after writing them. Opus 4.6 made zero — it wrote all 6 source files correctly in a single pass, ran pytest once, passed 10/10 tests.

The cache token burn (4.8× more) suggests 4.7 does extended internal reasoning between each tool call. It’s thinking harder, which shows up in better code quality — more type hints (35 vs 25 function definitions), more thorough coverage (820 vs 471 lines of code). But you pay for that thinking process.

Quality vs Cost Trade-off

The output quality difference is genuine. Opus 4.7’s code was more defensively written — better typed, more thorough on edge cases. When you’re in the middle of debugging a genuinely hairy distributed system problem or making an architectural call with real downstream implications, that extra care is worth something.

But for this Level 1 coding challenge, both approaches delivered identical functionality. The question becomes: is 40% better typing and 74% more comprehensive coverage worth 260% higher costs?

When the Premium Isn’t Optional

I tested both models in one-shot mode (no tools, single response) to see if you could avoid the iterative cost overhead.

Metric Opus 4.6 Opus 4.7 Output tokens 4,725 23,907 Cost $0.27 $0.94 Tests passed 10/10 1/10

Opus 4.7 failed catastrophically without tool access. It generated 5× more tokens but couldn’t follow output format instructions — most files were unparseable and 9 of 10 tests failed. Opus 4.6 passed perfectly on the first attempt.

This reveals a structural dependency: Opus 4.7’s quality advantages require the full agentic feedback loop. You can’t switch to a cheaper execution mode to control costs. The iterative self-correction that makes it better is also what makes it expensive — there’s no cheaper version of how this model works.

Scale It Out

Scale these numbers to something realistic:

100 equivalent coding tasks per day:

Opus 4.6: ~$38/day → ~$13,870/year
Opus 4.7: ~$138/day → ~$50,370/year
Annual cost increase: +$36,500

This matches what production teams are reporting. The Finout analysis documented overnight cost jumps from $500 to $675/day after deploying 4.7. My testing provides a mechanistic explanation: the model’s working style is token-intensive by design.

The cost increase compounds with Anthropic’s separate tokenizer changes that increase consumption up to 35% for identical prompts. You get hit twice: more tokens per task, plus each token costs more to count.

This Is Anthropic’s Move, Not the Industry’s

Other providers aren’t doing this. GPT-5.4 achieves comparable benchmark performance without the tokenizer change. Anthropic can pull this off because they’re ahead on benchmarks right now — that’s the advantage, and they’re using it.

Which means this is actually a model selection problem, not a budget problem.

I’m not upgrading my agentic workflows to 4.7 by default. For complex architectural work where the reasoning depth matters — distributed systems debugging, refactoring decisions with downstream implications — yes, 4.7 earns the premium. For routine code generation, test writing, documentation? 4.6 passes the same tests at a quarter of the cost, as I just demonstrated.

Sonnet is even more aggressive on cost for work that doesn’t need Opus-level reasoning at all. I’ve been pushing more of my day-to-day agentic tasks there.

GPT-5.4 is worth keeping in rotation too. Comparable coding benchmark performance, no tokenizer games, and the competitive pressure helps if you ever need to push back on Anthropic pricing.

The Reddit community caught the tokenizer changes within hours of release while Anthropic’s communications stayed focused on “unchanged pricing.” That’s the early warning system. Watch community cost reports when a new model drops, not the vendor announcement.

Anthropic will keep doing this as long as they’re leading. The way you stay ahead of it is knowing your actual token consumption per task — not the rate card, the real burn — and routing work to the cheapest model that gets the job done.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

It’s The Harness, Stupid!

Robert Matsuoka — Mon, 13 Apr 2026 17:17:16 GMT

It’s The Harness, Stupid!

Why AI tool orchestration now matters more than foundation model quality

Author: Bob Matsuoka, CTO @ Duetto Research
April 6, 2026

TL;DR

Same-model testing reveals 0.82-point quality spread (3.93 to 4.75) and 7x efficiency differences—orchestration dominates outcomes
Market validation: Claude maintains 70% developer preference despite GPT-5.4 achieving model parity through superior harness quality
Reddit analysis confirms Codex efficiency gains come from orchestration improvements, not just model upgrades
Competitive advantage has shifted permanently from model superiority to ecosystem superiority

Bottom line: The harness era has begun. Choose tools based on workflow fit, not benchmark claims.

The $50B Model Myth

The AI industry has a fixation problem. Every week brings breathless announcements about parameter counts, training costs, and benchmark scores. “GPT-6 has 50 trillion parameters!” “Our model scored 94.7% on SWE-bench!” “We spent $2 billion on compute!”

Three converging pieces of evidence prove this approach is fundamentally wrong.

Evidence #1: I tested eight AI coding agents across five programming challenges. Four agents used identical Claude Sonnet 4.6 models. Quality scores ranged from 3.93 to 4.75—a 0.82-point spread on the same foundation model.

Evidence #2: GPT-5.4 achieved parity with Claude Sonnet 4.6 on coding benchmarks. Yet Claude maintains 70% developer preference through superior ecosystem quality.

Evidence #3: Reddit developer communities confirm Codex’s efficiency improvements come from orchestration architecture changes, not just model upgrades.

The harness matters more than the model. Choosing an AI coding tool is now primarily an engineering decision, not a model selection decision. The next competitive advantage isn’t bigger models—it’s better orchestration.

Evidence Pillar #1: The Smoking Gun Laboratory Data

The Bake-Off Setup

I designed five programming challenges ranging from 30-minute tasks to 8-hour full-stack builds:

Level 1-2: Simple scripts and basic applications
Level 3: API integration with Docker containerization
Level 4: Extensible data processing pipeline (architecture test)
Level 5: Full-stack web application with authentication

Eight agents competed: Claude Code, Claude MPM, Codex, Gemini CLI, Auggie, Qwen+Aider, DeepSeek+Aider, and Warp AI. Each received identical prompts. A panel of expert developers blind-reviewed all submissions across eight criteria: functionality, correctness, best practices, architecture, code reuse, testing, error handling, and documentation.

The Harness Advantage Data

Table 1: Same Model, Different Worlds

Four agents using identical Claude Sonnet 4.6 models. Quality scores from 3.93 to 4.75—a 0.82-point spread. claude-mpm finished in 45 minutes while warp took 313 minutes. Almost 7x longer for lower quality results.

The Scaling Pattern

The harness advantage compounds with complexity:

Levels 1-2: All agents performed similarly. Simple tasks don’t reveal orchestration differences.
Level 3: API integration and Docker setup separated agents that plan from those that code-and-fix. Clear gaps emerged.
Levels 4-5: Architecture and full-stack challenges broke most agents. Only well-orchestrated systems completed the complex workflows.

The pattern is clear: as complexity increases, harness quality becomes the primary determinant of success.

Evidence Pillar #2: Market Validation — GPT-5.4 Caught Up

Model Parity Achievement

February-April 2026 benchmarks confirm GPT-5.4 has achieved parity with Claude Sonnet 4.6:

Core Benchmarks:

SWE-bench Verified: GPT-5.4 ~80% vs Claude 79.6% (statistical tie)
SWE-bench Pro: GPT-5.4 57.7% vs Claude 43.6% (GPT leads complex problems)
Terminal-Bench: GPT-5.4 75.1% vs Claude ~65% (DevOps advantage)
Context handling: Both models feature 1M token windows

Yet Claude Still Dominates Through Harness Advantages

Despite achieving model parity, the competitive landscape tells the harness story:

Market Reality:

Developer preference: Claude 70% (superior workflow integration)
Enterprise share: Anthropic +4.9% MoM growth, OpenAI -1.5% decline
Revenue: Claude Code $2B ARR in 6 months

Even when models reach parity, harness quality determines adoption.

The Multi-Model Strategic Reality

Leading organizations aren’t choosing between models anymore—they’re deploying three-tier strategic architectures based on cost-performance optimization:

Tier 1: Daily Workhorse (60-70% of requests)

Claude Sonnet 4.6: $3/$15 per million tokens
High-volume development, routine coding tasks
95%+ of premium model quality at half the cost
Default choice for most enterprise development work

Tier 2: Specialized Operations (20-30% of requests)

GPT-5.4: $2.50/$15 per million tokens
Terminal operations, DevOps workflows, CI/CD debugging
75.1% Terminal-Bench score (10-point lead over competitors)
Inherited Codex’s terminal operation dominance

Tier 3: Premium Analysis (10-20% of requests)

Claude Opus 4.6: $5/$25 per million tokens
Complex reasoning, architectural decisions, high-stakes analysis
World leader in abstract reasoning (87.4% vs GPT-5.4’s 83.9%)
When cost justifies maximum capability

This confirms the core thesis: when models are “good enough,” teams optimize for strategic cost-performance fit, not raw capability or marketing claims.

Evidence Pillar #3: Community Validation — The Codex Orchestration Story

Reddit Confirms Orchestration Improvements

Reddit research explains Codex’s impressive efficiency results (42 minutes, 4.49 quality score). The evidence confirms improvements come from orchestration, not just model upgrades.

Architectural Evolution Evidence:

Workflow Efficiency Improvements:

Developers report queuing “4-5 Codex tasks before diving into manual work”
“2-3 completed PRs waiting for review” after a coffee break
P99 response time 45ms vs Copilot’s 55ms through better context management
Parallel processing capabilities that enable true background orchestration

Enterprise Orchestration Benefits:

The Community Strategic Deployment Pattern

Reddit developers now recommend different tools for different purposes:

Claude Code: Code quality and reasoning
Cursor: Daily coding integration
OpenAI Codex: Complex multi-agent workflows and long-horizon autonomy

This matches exactly what the market data predicted: teams use orchestrated tools strategically rather than seeking one universal solution.

The Harness Quality Ladder

Based on all three evidence pillars, I see four tiers of orchestration quality emerging:

Tier 1: Basic Wrappers

Simple API access, minimal context management
Examples: Raw ChatGPT interface, basic API wrappers
Limitation: No file coordination, poor context retention

Tier 2: Workflow Tools

File awareness, some context management
Examples: GitHub Copilot, basic IDE extensions
Capability: Single-file optimization, limited cross-file understanding

Tier 3: Orchestrated Systems

Multi-file coordination, workflow integration
Examples: Cursor, Claude Code, well-configured aider
Advantage: Understands project structure, handles complex tasks

Tier 4: Agentic Frameworks

Multi-agent coordination, planning, verification
Examples: claude-mpm, advanced orchestration systems
Power: Full project lifecycle, quality assurance, architectural thinking

The performance cliff between tiers is exponential, not linear. Bad orchestration can make great models perform poorly; great orchestration can make good models perform excellently.

Academic and Industry Validation

This isn’t just empirical observation. Multiple 2026 research papers and industry studies support the harness thesis:

Academic Consensus:
The arXiv paper “Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems” shows that domain-tuned models with better orchestration achieve superior cost-normalized accuracy despite using smaller base models.

SWE-bench data reveals the same pattern. Cursor, Claude Code, and Auggie all use similar base models yet score between 50.2% and 55.4%, while the raw model score is only 45.9%. The 5.9-point improvement comes entirely from better context retrieval and agent design.

Business Reality Check:
Enterprise adoption surveys show a clear shift in CTO priorities. “Model performance” is dropping in tool evaluation criteria, replaced by governance, integration quality, and workflow fit. As one 2026 McKinsey report put it: “CTOs are realizing their biggest bottleneck isn’t model performance—it’s governance.”

What This Means for Engineering Leaders

Stop Optimizing for Benchmarks

The old procurement mindset was model-first: “We need access to GPT-6 for competitive advantage.” The new reality is that benchmark performance doesn’t predict practical utility. SWE-bench scores don’t tell you whether a tool will integrate with your existing workflow, handle your codebase size, or recover gracefully from errors.

Start evaluating harness quality:

Context management: How well does it understand your project structure?
File coordination: Can it work intelligently across multiple files?
Error recovery: Does it handle failures gracefully or require constant babysitting?
Workflow integration: How does it fit with your team’s existing development process?

Budget for Orchestration Quality

The three evidence pillars show that investing in better orchestration yields measurable returns:

Quality per minute: claude-mpm achieved 4.75 quality in 45 minutes; warp achieved 3.94 in 313 minutes
Market validation: Claude maintains dominance despite model parity through superior developer experience
Enterprise results: 70% more PRs, 50% faster code review, 67% faster turnaround

The ROI case for harness investment is clear and quantifiable.

Team Productivity Focus

Tool choice impacts your entire development pipeline. The 7x speed difference between well and poorly orchestrated tools using the same model means tool selection is a productivity multiplier, not just a capability decision.

Better tools also reduce onboarding time and increase adoption rates. A tool that works reliably gets used; one that requires constant troubleshooting gets abandoned.

The Competitive Landscape Evolution

Codex Deserves Recognition

Codex’s performance has significantly improved. At 42 minutes for all five levels with a 4.49 quality score, it achieved by far the best efficiency in my study. GPT-5.4+ combined with the orchestration improvements OpenAI made represents a compelling package. The Reddit research confirms this wasn’t just a model upgrade—it was an architectural evolution toward multi-agent orchestration.

Claude Code’s Harness Moat

While Claude Code performed well (4.53 quality score), the market validation shows its true strength: ecosystem superiority. Despite GPT-5.4 achieving model parity, Claude maintains 70% developer preference through superior harness quality. This is exactly what sustainable competitive advantage looks like in the post-parity era.

The Multi-Model Future

All evidence points to the same conclusion: the era of picking one model is over. Leading organizations deploy three-tier cost-performance architectures, optimizing for specific strengths rather than seeking universal solutions.

Real enterprise case studies validate this pattern:

The successful pattern: Sonnet for volume, GPT-5.4 for DevOps, Opus for complexity.

The Token Economics Reality

claude-mpm achieved the highest quality score (4.75) but used 87 million tokens versus codex’s 120K. This looks expensive until you consider the output: 262 comprehensive tests (vs codex’s 32), complete documentation, 100% verification rates, and multi-file coordination (note: this was also a wake-up call to me to focus on token optimization, current version is much stingier)

The 700x token multiplier isn’t overhead—it’s the cost of work a solo agent skips. Orchestration doesn’t waste tokens—it spends them on comprehensive deliverables.

The optimization question: Could you achieve 80% of the quality benefits at 30% of the token cost? The opportunity isn’t eliminating orchestration—it’s finding the minimal viable team size for maximum impact.

The Vendor Bias Problem: “Opus for Everything”

Boris Cherny, the Claude Code lead, recently advocated for using “Opus for everything.” This perfectly illustrates the disconnect between vendor recommendations and practical deployment reality.

Only someone working for Anthropic can say that.

When your employer provides unlimited access to premium models, of course you’d recommend the most expensive option for every task. But real organizations operating with P&L responsibility make strategic decisions about when premium capability justifies premium cost.

This vendor bias actually validates the multi-model thesis:

Vendors say: “Use our premium model for everything”
Users do: Strategic model selection based on task complexity and budget constraints
Market reality: 70% prefer Claude for daily coding (cost/speed), GPT-5.4 for complex reasoning (quality ceiling)

Cherny’s comment inadvertently proves that cost-conscious orchestration is the real competitive battleground. Companies that figure out optimal model routing—not maximal model usage—will have sustainable advantages.

The vendors push premium. The market chooses strategically. The harness makes both possible.

The Future: Welcome to the Harness Era

What Changes for Developers

Tool selection framework:

Workflow fit: Does it match how your team works?
Integration quality: Plays well with existing tools?
Reliability: Can you trust it with production code?
Model quality: Fourth priority

What Changes for the Industry

Foundation models are becoming commodities. Differentiation shifts to integration, context management, and user experience. The next unicorns will be harness companies, not model companies.

Major funding flows to orchestration companies. Enterprise procurement evaluates integration first, model second.

The Competitive Moat Shift

The old game was: train bigger models, claim benchmark superiority. The new game is: build better orchestration, solve real workflow problems. Model access becomes a utility; workflow mastery becomes the moat.

Practical Recommendations

For CTOs and Engineering Leaders

Audit orchestration quality: Test tools with your actual codebase for 2-week trials
Budget 60/40: Spend more on harness development than model subscription fees
Measure real metrics: Track pull request velocity and code review time, not benchmark scores
Evaluate integration first: How well does it fit your existing CI/CD pipeline?

For Developers

Test with real projects: Spend 2 days with each tool on actual work before deciding
Learn orchestration patterns: Context management and file coordination matter more than prompts
Invest in mastery: The 7x efficiency difference justifies significant learning time
Ignore marketing claims: Model access means nothing without good orchestration

For the AI Industry

Build for workflow integration: Solve real development pipeline problems
Measure practical utility: Developer retention and task completion rates beat benchmarks
Focus on context management: Multi-file coordination is the real competitive moat

Conclusion: The Questions That Matter Now

The old question was: “What’s the best model?”

The new question is: “What’s the best harness for my team’s workflow?”

Three evidence sources prove we’ve crossed a threshold: foundation models are “good enough,” and orchestration quality now dominates outcomes. Laboratory testing, market validation, and community confirmation point to the same reality.

The foundation model is the engine. The harness is the car. The best engine in the world won’t get you anywhere without wheels.

The harness era has begun. Drive accordingly.

Bob Matsuoka is CTO at Duetto Research and creator of Claude MPM, one of the agents evaluated in this study. All evaluation data and methodology are available at github.com/bobmatnyc/ai-coding-bake-off for reproducibility.

Appendix: Complete Results Data

Quality Scores by Criterion

GPT-5.4 vs Claude Sonnet 4.6 Market Data

SWE-bench Performance:

SWE-bench Verified: GPT-5.4 ~80% vs Claude 79.6% (statistical tie)
SWE-bench Pro: GPT-5.4 57.7% vs Claude 43.6% (GPT advantage on complex problems)
Terminal-Bench: GPT-5.4 75.1% vs Claude ~65% (GPT DevOps advantage)

Market Metrics:

Developer preference (daily coding): Claude 70%
Enterprise market share: Anthropic +4.9% MoM, OpenAI -1.5% MoM
Claude Code revenue: $2B ARR in 6 months

Methodology Notes

Laboratory data: Single run evaluation with disclosed author bias
Market data: Cross-validated across 15+ authoritative sources
Community research: Reddit analysis across 8+ developer subreddits
Statistical confidence: Mean inter-reviewer deviation of 0.216 points
Reproducible: All data and prompts available in public repository

I Met a Movie Star Mila Jovovich — As a Coder

Robert Matsuoka — Sat, 11 Apr 2026 12:31:14 GMT

I didn’t expect to meet Mila Jovovich through a GitHub issue.

But there I was last week, deep-diving into her AI memory framework called MemPalace, when I discovered something remarkable: the “Resident Evil” and “Fifth Element” star had created one of the most talked-about AI memory systems of 2026. And she’d done it using Claude Code, the same AI-assisted development environment I use daily.

More remarkably, when I found critical bugs in her benchmark methodology, she responded directly through her Claude Code workflow, acknowledging the issues and implementing fixes. Not through a PR team or engineering intermediaries — Mila herself, using AI-assisted development to debug complex memory retrieval algorithms at 9 AM on a Thursday.

This isn’t a story about a celebrity coding stunt. It’s about something much more profound: we’ve entered an era where outcomes and features drive development, not the technical limitations of writing code.

The MemPalace Phenomenon

In April 2026, Mila Jovovich and developer Ben Sigman released MemPalace, an open-source AI memory system that immediately went viral. Within 48 hours, it had over 23,000 GitHub stars. The system claimed to achieve the first perfect score on the LongMemEval benchmark, scoring 96.6% raw recall.

The project represents something unprecedented: a free, locally-running memory system that rivals expensive cloud alternatives like Mem0 ($19-249/month) and Zep ($25+/month). It uses the “memory palace” technique — a classical memory method dating back to ancient Greece — implemented through ChromaDB and SQLite, with zero ongoing API costs.

The technical architecture includes basic Claude Code integration (save hooks every 15 messages and before context compression) and 24 tools via the Model Context Protocol (MCP), making it compatible across multiple AI platforms.

The duo had spent months building it using Claude Code’s AI-assisted development environment. As Sigman noted, he provided “the engineering chops” while Jovovich drove the architectural vision.

When Audits Meet AI-Generated Code

That’s when things got interesting.

As someone who works extensively with AI memory systems — I maintain KuzuMemory, a graph-based memory framework — I was naturally curious about MemPalace’s benchmark methodology. The claimed 96.6% recall rate was extraordinary, especially for a system running entirely locally.

So I dove in.

What I found were several methodological issues that fundamentally undermined the headline numbers. The benchmark adapter was discarding assistant turns in conversation history, causing systematic under-recall on certain question types. More critically, the benchmark wasn’t actually testing MemPalace’s core functionality — it was primarily testing ChromaDB’s raw vector search capabilities.

I filed Issue #242 documenting the assistant turn bug, and Issue #214 showing that the 96.6% score was essentially a ChromaDB score, not a MemPalace score.

Mila’s response was immediate and technically sophisticated:

“Hey @bobmatnyc — I’ve taken a look and ran it through CLI. This is a real bug and it’s urgent. You caught that benchmarks/longmemeval_bench.py at lines 189-190 builds each session’s indexed document by concatenating only user role turns... Fix priority: this must land before any public benchmark re-run.“

She didn’t deflect or dismiss. She debugged the issue herself, identified the exact lines of code causing the problem, explained the downstream impact on other benchmarks, and outlined a detailed fix plan including regression tests.

This wasn’t PR speak. This was an AI-assisted developer engaging seriously with technical criticism.

The Democratization Shift

This interaction crystallized something profound about our current moment in software development.

We’re witnessing the emergence of a new class of builders: technically-minded individuals who understand software conceptually but may not have traditional coding backgrounds. AI-assisted development tools like Claude Code, GitHub Copilot, and Cursor have lowered the implementation barrier to the point where vision and domain expertise matter more than syntax mastery.

Mila Jovovich exemplifies this shift perfectly. Without formal technical education (she left school in 7th grade for modeling), she spent months intensively learning AI-assisted development through Claude Code starting in late 2025. She understood the conceptual framework of memory palaces deeply enough to architect a sophisticated system. Her collaboration with Ben Sigman — CEO of Bitcoin lending platform Libre Labs, who provided the engineering expertise while she drove architectural vision — represents a new model of software development where domain knowledge and AI tool fluency can substitute for traditional programming backgrounds.

The fact that a movie star can release a technically competent, widely-adopted memory framework isn’t a commentary on coding getting easier (though it has). It’s about software development becoming more accessible to domain experts and visionaries who previously couldn’t bridge the implementation gap.

What MemPalace Gets Right

Despite the benchmark issues I uncovered, MemPalace demonstrates genuine technical sophistication. The memory palace metaphor isn’t just marketing — it’s a thoughtful architectural choice that makes AI memory systems more intuitive and debuggable.

The system includes elegant features like per-agent memory “wings” that prevent cross-contamination between different AI assistants. The Claude Code integration hooks are well-designed, automatically triggering memory saves at logical conversation boundaries. The MCP implementation is clean and follows established patterns.

Most importantly, the project tackles a real problem: most AI memory systems are either expensive cloud services or complex local installations. MemPalace provides a middle path that’s both free and relatively easy to deploy.

Through my testing and integration experiments, I learned techniques that improved my own KuzuMemory system. The competitive analysis forced me to think more carefully about memory organization patterns and retrieval strategies. This kind of cross-pollination benefits the entire ecosystem.

The Validation Requirement

But the benchmark controversy highlights a crucial point: democratized software development still requires traditional validation methods.

AI-assisted coding tools excel at implementation but can perpetuate subtle conceptual errors throughout a codebase. The MemPalace benchmark issues weren’t obvious bugs — they were methodological problems that required domain expertise to identify.

This creates an interesting dynamic: AI tools enable rapid development by non-traditional developers, but peer review by experienced practitioners becomes even more critical. The community response to MemPalace’s inflated benchmarks wasn’t hostile — it was collaborative debugging at scale.

Mila’s willingness to engage directly with technical criticism and implement fixes demonstrates the right approach. The democratization of software development doesn’t eliminate the need for technical rigor; it distributes that rigor across a broader community.

The Harness Thesis Validated

This story perfectly validates what I call the “harness thesis” — that we’ve entered an era where AI tool ecosystems matter more than underlying model capabilities.

MemPalace succeeded not because Mila wrote perfect code from scratch, but because she effectively orchestrated Claude Code to implement her vision. The system’s value comes from its architectural choices, integration quality, and user experience — not from novel algorithmic breakthroughs.

Similarly, my ability to audit and improve the system came not from superior coding skills, but from having developed complementary expertise with memory systems and benchmark methodology. The collaboration that emerged — distributed across GitHub issues, with contributors from multiple backgrounds — represents the new model of software development.

We’re not just building different software; we’re building software differently.

Meeting Mila Through Code

In the end, I did meet Mila Jovovich — through our AI Agents, lines of Python code, GitHub issues, and technical discussions about memory retrieval algorithms, mediated by our respective Claude Code workflows. Not the meeting I would have predicted, but somehow more meaningful than a typical celebrity encounter.

She embodies a new archetype: the technical visionary who uses AI tools to implement sophisticated ideas without traditional programming backgrounds. Her willingness to engage with criticism and continuously improve the system demonstrates the collaborative spirit that makes this new era of development possible.

The future of software isn’t just about better AI models or more powerful tools. It’s about enabling more people with domain expertise and creative vision to participate in building the systems that shape our digital world.

And sometimes, that means meeting your childhood movie star idol in a GitHub issue thread, debugging memory palace algorithms together.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

It’s The Harness Stupid — Why AI tool ecosystems matter more than model capabilities
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Referenced Links:

The Software Factory is the Next Big Challenge

Robert Matsuoka — Wed, 08 Apr 2026 12:30:16 GMT

The Software Factory is the future of software development

Stripe engineers send Slack messages that automatically become production code. Not suggestions. Not drafts. Production code merged into their main branch, supporting over a trillion dollars in annual payment processing.

Their “Minions” system generates 1,300 pull requests per week with zero human-written code. Fire-and-forget automation from conversation to deployment. While the rest of us debate whether AI can write good code, Stripe has built a software factory that produces enterprise-grade applications at scale.

The software factory isn’t a future concept. It’s a present reality, and it represents the next fundamental challenge for engineering organizations.

What We’re Building at Duetto

I’ve been thinking about this a lot lately. At Duetto, we’re exploring what a software factory could look like for hospitality technology. Not because we want to eliminate developers, but because we’re hitting the limits of traditional development approaches for domain-specific applications.

Our challenge isn’t just writing code—it’s translating complex hotel revenue management requirements into software that works reliably across thousands of properties with different systems, data formats, and business rules. The cognitive load of keeping all these variations in mind while building features is becoming unsustainable.

What if we could describe what we need in something like our APEX specifications, and have the system generate not just code, but complete deployments? Kubernetes instances running Claude Code agents, database migrations, monitoring setup, the whole stack configured for that specific use case.

The goal isn’t replacing our engineering team. Our developers should be solving revenue optimization algorithms and building domain-specific integrations, not configuring YAML files for the hundredth deployment variation.

The Stripe Blueprint

Stripe’s Minions reveal what a production software factory actually looks like when you strip away the hype and focus on what works.

Five-Layer Pipeline: Their system transforms Slack messages into production-ready pull requests through a structured pipeline. Not magic—engineering discipline applied to automation.

Sandboxed Execution: Every agent runs in isolated containers with codebase checkouts. They can’t access production systems, can’t cause cascading failures, can’t break things outside their designated scope. The walls matter more than the model.

Surgical Tool Selection: Their Model Context Protocol provides access to hundreds of internal tools, but agents get intelligently prefetched access to only the ~15 tools relevant to their specific task. Not everything available—the right things available.

One-Shot Optimization: Instead of conversational back-and-forth, their agents are optimized for well-defined work that completes in a single execution. Better latency, lower costs, more predictable outcomes.

The results speak for themselves: 1,300 PRs weekly, zero human-written code in merged changes, supporting their entire payment infrastructure. This isn’t a pilot program. This is their production development workflow.

The Broader Software Factory Landscape

Stripe isn’t alone in building these systems, just the most public about their approach.

Netflix has their federated developer console integrating dozens of tools into a single unified experience. Spotify’s Backstage holds 89% market share among internal developer platforms, reducing time-to-tenth-pull-request by 55% for new developers.

The open source ecosystem is catching up quickly. OpenHands provides a model-agnostic platform for cloud coding agents with $18.8M in Series A funding. CodeT5 handles multi-language code generation. GitHub Copilot Enterprise is expanding beyond code completion into full workflow automation.

Major cloud providers are also building comprehensive platforms. Microsoft’s GitHub Copilot Workspace, Google’s Duet AI for developers, and Amazon’s Q Developer all represent enterprise-grade attempts at software factory capabilities.

According to Gartner, 80% of large engineering organizations now have dedicated platform teams. The question isn’t whether software factories are coming—it’s whether your organization will build one or buy one.

What a Proper Software Factory Requires

Building a software factory isn’t just about connecting AI tools to deployment pipelines. Based on what’s working at Stripe and emerging patterns across the industry, here are the essential components:

Artifact Response Systems

Your factory needs to respond to structured specifications and generate complete deployments. At Duetto, this might mean taking an APEX specification for a new revenue optimization feature and producing:

Kubernetes deployment configurations
Database migration scripts
Monitoring and alerting setup
Load testing scenarios
Documentation

The system should handle the entire deployment lifecycle from specification to running production service, not just generate code that someone has to manually deploy.

Strategic Human Review Checkpoints

Notice I said strategic, not comprehensive. Stripe’s fire-and-forget model works because they’ve identified the specific points where human judgment adds value without blocking automation.

For enterprise applications, you need checkpoints at:

Specification validation: Do the requirements make business sense?
Security review: Are access patterns and data handling appropriate?
Integration testing: Does this work with existing systems?
Production readiness: Are monitoring and rollback capabilities sufficient?

The key is making these gates fast and decisive, not bureaucratic approval processes that defeat the purpose of automation.

Scaffolding for Error Detection

Your factory will produce broken code. That’s not a bug—that’s reality. The difference between a prototype and a production system is sophisticated error detection and recovery.

This means:

Isolated execution environments where failures can’t cause broader damage
Automated testing and iteration when initial attempts fail
Multi-layer validation before anything reaches production
Comprehensive rollback capabilities for when something gets through anyway

Stripe’s sandbox architecture is brilliant because it lets agents fail safely while learning from those failures to improve future attempts.

Success Criteria Parameters

Your factory needs to know what success looks like for each type of work. Not just “the code compiles,” but measurable business outcomes.

For a hospitality feature, success might mean:

Performance benchmarks met under load
Integration tests pass with five different PMS systems
Revenue impact measurable within 30 days
Zero customer-facing errors in the first week

Define these criteria upfront, build them into your validation pipeline, and let the factory optimize for actual business value rather than technical metrics alone.

Cost Tracking and Optimization

AI-powered development isn’t free. You need visibility into the computational costs, tool usage, and human review time for each generated system.

Stripe optimizes for this explicitly—their one-shot agents cost less than conversational approaches, their surgical tool selection reduces context costs, their automated testing prevents expensive human debugging cycles.

Track these metrics from day one. The difference between a cost-effective software factory and an expensive experiment is usually found in the operational details.

Deployment Models

Your factory needs sophisticated understanding of how to deploy different types of applications. Golden Path workflows that codify best practices, environment promotion strategies that reduce risk, and rollback procedures that restore service quickly when things go wrong.

This is where domain expertise becomes critical. A generic software factory might know how to deploy a web service, but does it understand the specific requirements for hospitality payment processing, guest data privacy, and integration with property management systems?

The Duetto Context

At Duetto, we’re thinking about how a software factory could handle the complexity of hospitality technology. Our domain has unique challenges:

Data Integration Complexity: Every hotel uses different systems with different data formats. A software factory needs to understand these variations and generate appropriate integration code.

Regulatory Requirements: Guest privacy, payment processing, accessibility compliance. The factory needs to embed these requirements into everything it produces.

Performance Characteristics: Revenue management systems need to process pricing updates in near real-time across thousands of rooms and rate plans. The factory needs to optimize for these specific performance patterns.

Operational Constraints: Hotels can’t afford downtime during peak booking periods. Deployment strategies need to account for hospitality business cycles.

We’re not trying to build a general-purpose software factory. We’re exploring how to build one that deeply understands our domain and can produce applications that work reliably in hospitality environments.

The Reality Check

Building a software factory is hard. Not because the technology doesn’t exist—Stripe proves it does—but because the organizational challenges are substantial.

ROI Demonstration: You need to show measurable productivity improvements and cost savings. “The AI is impressive” isn’t sufficient justification for the investment required.

Security and Compliance: Automated code generation that touches customer data or payment systems requires additional security layers and audit capabilities.

Developer Workflow Changes: Your engineering team needs to learn new ways of working. Some will embrace it, others will resist. Change management is as important as the technical implementation.

Quality Assurance Evolution: Your QA processes need to evolve from testing human-written code to validating AI-generated systems. Different failure modes, different testing strategies.

Integration Complexity: Your factory needs to work with existing systems, databases, APIs, and workflows. The harder the integration challenge, the longer the implementation timeline.

These aren’t reasons to avoid building a software factory. They’re reasons to approach the project with realistic expectations and proper preparation.

Looking Forward

The trajectory is clear. Software factories are moving from experimental to mainstream, with proven systems operating at enterprise scale and standardized architecture patterns emerging across the industry.

The question for engineering leaders isn’t whether this transformation will happen. It’s whether your organization will be an early adopter that shapes how software factories work in your domain, or a later adopter that implements patterns developed by others.

At Duetto, we’re betting on being early. Not because we want to be on the cutting edge for its own sake, but because the companies that figure out domain-specific software factories first will have a significant competitive advantage in application development speed and quality.

The software factory represents the next evolution of platform engineering. The organizations that master it will build better software faster than those that don’t.

The challenge isn’t technical anymore. It’s organizational, strategic, and operational.

The question is: Are you ready to build one?

About this analysis: This piece draws from comprehensive research on production software factory implementations, including detailed analysis of Stripe’s Minions architecture, enterprise platform engineering initiatives, and emerging open source solutions. The author is exploring software factory applications for hospitality technology at Duetto.

About the author: Bob Matsuoka is Chief Technology Officer at Duetto and creator of Claude MPM (Multi-agent Project Manager). He has implemented AI-assisted development workflows across enterprise engineering teams and writes about the practical realities of AI integration in software development at HyperDev.

Is The Claude Code Team Moving Too Quickly?

Robert Matsuoka — Mon, 06 Apr 2026 12:30:41 GMT

On March 31, 2026, Anthropic accidentally shipped their entire Claude Code source—512,000 lines of TypeScript—in an npm package. What followed was perhaps the most intense technical autopsy in AI history. The verdict? Mixed, and revealing.

The criticism has been swift and pointed. A 5,594-line file with a single 3,167-line function sporting 12 levels of nesting. Regex-based frustration detection looking for “wtf” and “shit”. A quarter million wasted API calls per day from a three-line bug. As one critic put it: “A multi-billion-dollar AI company is detecting user frustration with a regex.”

But before we pile on, we need to ask: What does “good code” even mean when you’re building client-side LLM applications?

The Unprecedented Challenge

Claude Code isn’t your typical software. It’s a client-side application that orchestrates conversations with large language models, manages context across sessions, and attempts to maintain coherent state while working with fundamentally non-deterministic systems.

This creates problems that traditional software engineering practices weren’t designed for:

Context management: Handling arbitrarily long conversations that exceed model limits
Failure recovery: When your core computation is a 20% failure-rate API call
State synchronization: Keeping UI, conversation history, and model context aligned
Dynamic adaptation: Code that needs to adapt to changing model capabilities

The leaked source reveals sophisticated solutions to these problems: a three-layer memory architecture, anti-distillation mechanisms, dual parser systems for safety. The engineering is ~~genuinely~~ impressive, even if the implementation is sometimes ugly.

The Meta-Problem: AI Writing AI

Claude Code was partially written by Claude Code. This represents the first documented case of a large-scale AI tool generating significant portions of its own source code—not just incremental improvement, but a categorical change in development methodology that creates unprecedented quality control challenges when AI-generated code scales beyond human review capacity.

When AI generates code at scales that exceed human review capacity, traditional quality control breaks down. That 3,167-line function? Probably not written by a human. The 12 levels of nesting? Algorithmic patterns, not human design choices.

This is the real story: We’re witnessing the first major autopsy of self-bootstrapping AI tooling.

Deterministic vs. LLM Code: Different Standards Apply

I’ve been thinking about this distinction a lot lately in my work with Claude MPM, an open-source multi-agent code generation framework built on Claude Code that coordinates specialized AI agents for software development workflows. When you’re building traditional, deterministic software, all the usual rules apply. Clean functions, clear abstractions, maintainable architecture. Use your normal code analysis tools.

But when you’re building LLM-integrated systems, the rules change:

Failure is the default: Your core operations fail 20% of the time
Context is expensive: Every token counts toward limits
Behavior is emergent: The system does things you didn’t explicitly program
Adaptation is constant: Model capabilities change monthly

In this world, a 5,594-line file might be ugly, but if it successfully manages complex failure recovery across multiple conversation threads, it might also be correct.

The Code Analysis Checkpoint Strategy

This is where I’ve found success with my recent updates to the code analyzer in Claude MPM. The analyzer utilizes mcp-vector-search for comprehensive codebase analysis, providing AST-based semantic search, full-text search capabilities, and knowledge graph construction for architectural pattern detection. Instead of trying to prevent AI from generating messy code (impossible), I focus on regular refactoring and analysis checkpoints.

The analyzer has gotten very good at catching two specific issues:

Drift: When AI-generated code slowly diverges from intended architecture
Bloat: When generated solutions become unnecessarily complex over time

I make a point to run these checkpoints regularly, treating them as essential maintenance rather than optional cleanup. It’s like running cargo clippy or eslint, but for AI-generated architectural decisions.

The key insight: AI code needs different kinds of maintenance than human code.

Outcome-Based Generation: Does It Work?

Here’s my perhaps controversial take: If Claude Code successfully helps developers ship better software faster, then the messy internals might not matter as much as we think.

The leaked code reveals a system that:

Handles millions of conversations per day
Maintains context across arbitrarily long sessions
Provides sophisticated memory management
Implements multiple safety layers
Delivers a $2.5 billion ARR product experience

Is the implementation elegant? No. Does it work? Apparently, yes. Because we can observe/measure what it’s building completely independently of what built it.

This doesn’t excuse basic engineering failures (that .npmignore mistake was embarrassing). But it does suggest we need new frameworks for evaluating AI-generated systems.

The Scaffolding Solution

Rather than trying to make AI generate perfect code, we can scaffold around the inevitable messiness:

Automated refactoring checkpoints: Regular cleanup of AI-generated bloat
Architectural constraints: Guard rails that prevent the worst patterns
Outcome validation: Testing that focuses on behavior over implementation
Human oversight: Strategic points where humans validate AI decisions

This is the approach I’ve been taking with Claude MPM, and it’s proven remarkably effective. Let the AI generate messy-but-functional code, then use tooling to clean it up systematically.

What This Means for the Industry

The Claude Code leak represents a watershed moment. It’s our first real look at what happens when AI tools build themselves at scale.

The criticism is valid—basic engineering discipline matters, even in AI systems. A missing .npmignore file is inexcusable for a billion-dollar product.

But the deeper question is whether we’re applying the right standards. Traditional code quality metrics may not capture what actually matters for AI-integrated systems.

Moving Forward

Anthropic probably is moving too quickly in some ways. The leak revealed security vulnerabilities, competitive intelligence losses, and quality control failures that suggest inadequate human oversight.

But they’re also pioneering entirely new categories of software. The problems they’re solving—context management, failure recovery, human-AI collaboration—don’t have established best practices yet.

The real lesson isn’t that AI-generated code is inherently bad. It’s that we need new practices for building, reviewing, and maintaining systems that exceed human comprehension scales.

The question isn’t whether Claude Code’s internals are messy. It’s whether we can build better scaffolding around AI-generated systems to catch the problems that matter while accepting the messiness we can’t avoid.

The Claude Code team probably needs to slow down on the basics—security, testing, deployment hygiene. But they’re moving fast on problems that genuinely require speed to solve before competitors do.

That’s a nuanced position in an industry that loves simple takes. But nuance is what the moment requires.

What do you think? Are we being too hard on AI-generated code, or not hard enough? Share your thoughts in the comments.

About this analysis: This piece draws from extensive technical analysis of the March 31, 2026 Claude Code source leak, including community responses, security assessments, and business impact analysis. The author maintains active development projects using AI-assisted coding tools and has direct experience with the challenges discussed.

Moving Past the 10-Tab Workflow

Robert Matsuoka — Wed, 01 Apr 2026 12:32:10 GMT

From Tab Chaos to Autonomic Orchestration

I’m looking at my (iTerm) terminal right now. Ten tmux sessions. Each session holds a different project context—one monitoring CI failures, another handling a code review, a third debugging a production issue.

This is the reality of modern agent development work.

TL;DR

Multi-session reality: Power users average 8-12 terminal sessions; most work involves modification, bug response, and PR handling—not new code generation
Natural workflow origination: Future systems trigger from product team actions, CI failures, and automated events rather than human prompts
Orchestration evolution: From human-orchestrated agents to orchestration-of-orchestrators where prime coordinators are non-human
Production examples: Stripe’s Minions (1,300 PRs/week), GitLab’s Duo Agent Platform, Meta’s REA demonstrate hierarchical agent orchestration
Architecture shift: Claude Code’s SDK model enables workflow-driven development through persistent, context-aware agent orchestration

The 10-Tab Reality

According to recent developer workflow studies, tmux has become the standard for AI-assisted development, with persistent sessions solving the context-switching tax. The productivity advantage isn’t the multiplexing—it’s the persistence. Projects become environments you step in and out of rather than things you open and close.

But here’s what the productivity tutorials miss: most of those tabs aren’t generating software.

My current session breakdown:

3 sessions: non-coding -- my CTO knowledge base (currently analyzing our Sumo use), a writing assistant, and our Duetto product management framework
4 sessions: coding - various internal tools and MCP connectors
2 sessions: coding - new projects
1 session: code review

The 8:2 ratio holds across most senior developers I’ve observed. Most development work involves responding to existing systems, not creating new ones.

This distribution points toward something significant: the future of development orchestration isn’t human-initiated.

Beyond Prompt-Driven Development

Claude Code’s new SDK architecture reflects this reality. Instead of starting with human prompts, work originates from natural workflow events:

Product team creates ticket → Implementation specification generated
CI pipeline fails → Diagnostic agent analyzes failure, proposes fix
PR submitted → Review agent examines code, suggests improvements
Production alert triggered → Incident response agent investigates, documents findings
Security scan detects vulnerability → Remediation agent generates patch

The pattern: Event → Agent Response → Human Review → Autonomous Resolution.

Humans remain in the loop, but as orchestrators and validators rather than initiators. The shift from “What should I build?” to “How should this system respond?”

Yes, Dall-e is Still Atrocious With Spelling. Don’t @ me!

Orchestration of Orchestrators: Production Examples

Stripe’s Blueprint Architecture

Stripe’s Minions system demonstrates mature orchestration-of-orchestrators. Their “blueprint” pattern alternates between deterministic code nodes and agentic reasoning loops, generating 1,300+ pull requests weekly.

Architecture insight: Each blueprint functions as a strict contract between orchestration and execution. Task definitions specify input requirements, output formats, constraints, and success criteria. The orchestrator manages workflow, agents handle implementation.

Security model: Every Minion execution runs in isolated VMs with no internet or production access. The system has submission authority but not merge authority—all changes require human review.

GitLab’s Intelligent Orchestration

GitLab’s Duo Agent Platform treats agents as durable actors that plan, modify code, fix pipelines, and enforce security with traceability. Multiple AI agents handle parallel tasks—code generation, testing, CI/CD fixes—while developers maintain oversight through defined rules.

Orchestration insight: GitLab positions itself as an AI orchestration plane where humans and agents share delivery responsibility. The platform coordinates multi-agent workflows across the entire software lifecycle rather than providing isolated AI tools.

Meta’s Hierarchical Agent Systems

Meta’s Ranking Engineer Agent (REA) demonstrates autonomous ML lifecycle management. REA Planner and REA Executor components, supported by shared skill and knowledge systems, autonomously evolve ads ranking models at scale.

Acquisition significance: Meta’s $2B Manus acquisition focused on orchestration infrastructure rather than foundation models. Manus’s achievement was engineering an execution layer enabling models to browse, code, manipulate files, and complete multi-step workflows autonomously.

The Architecture Implications

Beyond the Single-Agent Model

The production examples reveal a consistent pattern: successful autonomous development requires hierarchical orchestration rather than monolithic AI assistants.

Traditional approach: Human → Single Agent → Code
Emerging pattern: Event → Orchestrator → Specialized Agents → Validation → Resolution

Context Preservation at Scale

The tmux paradigm of persistent sessions maps directly to agent orchestration. Instead of recreating context for each interaction, systems maintain ongoing project understanding across multiple concurrent workflows.

Implementation insight: iTerm2’s tmux integration (-CC mode) provides the UI pattern for agent orchestration—persistent remote workspaces with native interface feel. The same architecture principles apply to agent coordination.

Where This Leads

Non-Human Prime Orchestrators

The logical endpoint isn’t humans managing multiple agents—it’s orchestrating systems that manage agent ecosystems. According to Gartner’s 2025 Agentic AI research, nearly 50% of surveyed vendors identified AI orchestration as their primary differentiator.

Pattern emergence: Meta-agents or orchestrator-generalists will control specialized agents, assign tasks, interpret results, and revise goals in real-time. Hierarchical orchestration becomes essential for enterprise-scale implementations.

The Developer Role Evolution

Instead of managing 10 terminal sessions, a framework orchestrates autonomous workflows. Each workflow maintains its own context, responds to its own triggers, and escalates to human attention when required. Some of those will be human/experimentation/new development driven, the majority will be responding to the automated lifecycle.

Skills that matter:

Workflow boundary definition: Which autonomous streams can operate independently?
Escalation criteria design: When do workflows require human intervention?
Cross-workflow dependency management: How do autonomous streams coordinate?
Quality gate enforcement: What validation must occur before autonomous resolution?

Implementation Considerations

Teams experimenting with orchestrated autonomous development should consider:

Event-driven architecture: Which existing workflows could trigger autonomous responses?
Context preservation systems: How will agent workflows maintain project understanding?
Isolation and security: What boundaries prevent autonomous agents from causing damage?
Human oversight integration: Where do human validation points occur in autonomous workflows?
Cross-workflow coordination: How do parallel autonomous streams avoid conflicts?

The transition from 10-tab manual orchestration to autonomous lifecycle orchestration isn’t theoretical. Stripe, GitLab, and Meta demonstrate production implementations. The question becomes implementation timeline and organizational readiness.

Early adopters are discovering that the competitive advantage comes not from having the smartest individual AI agents, but from orchestrating networks of specialized agents that collaborate effectively at scale.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Stripe’s Minions: Inside Their Enterprise AI Coding Agent Strategy — Blueprint orchestration architecture and production metrics
GitLab Duo Agent Platform — Intelligent orchestration across software lifecycle
Tmux Complete Guide: AI-Powered Multi-Agent Workflows — Terminal multiplexing for autonomous development
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Is This The Era of the Connector?

Robert Matsuoka — Thu, 26 Mar 2026 12:32:38 GMT

TL;DR

Users consolidate around 4 core platforms (Slack, Notion, Email+Office, AI Tool) while rejecting standalone/SASS tools
Connectors that bring data to users beat new tools that require visits
Infrastructure breakthrough: Slack manifests + MCP protocol + LLM services make org-specific connectors trivial
Democratization effect: Bootcamp engineers can now build sophisticated integrations that once required senior developers
Production evidence: 3 connectors (4-6 hours each, $1-2.5K in AI tokens) replaced 5-6 standalone tools that would cost $150-300K+ traditionally
Tolerance for “broad-based general tools” declining — UX mindshare captures traffic even when APIs do the work

In the past two weeks, I’ve built three connectors that collectively replaced what would have been five or six standalone tools.

Engineering Search Connector: Hosted semantic and knowledge graph search service built by repurposing mcp-vector-search. Unified search across 150+ GitHub repos, 1,700+ wiki pages, and ticket systems. Accessible through Slack bot, web interface, CLI, and MCP connector for Claude.AI that brings engineering knowledge to where people already work.

CRM Data Connector: Live customer data piped directly into Claude.AI sessions via MCP. No dashboard to check, no reports to generate. Ask “What’s our pipeline this quarter?” and get live data in 1-3 seconds.

Document Workflow Connector: Artifact browser and guided PR workflow for non-technical contributors. Product managers can explore and propose changes to structured docs without touching git or learning new interfaces.

None of these required users to adopt a new primary tool. Each brings specialized functionality to platforms they already inhabit daily. And each took roughly 4-6 hours to build (plus agent time).

This isn’t a productivity humble-brag. It’s evidence of a fundamental shift in how organizations interact with their data. We’re entering the connector era — building bridges between specialized intelligence and the handful of platforms where users actually live, rather than standalone applications they have to visit.

The numbers support this pattern. Users toggle between apps 1,200 times daily, losing 40% productivity to context switching. Connector ecosystems are exploding: Slack’s marketplace hosts 2,600+ apps with 550K+ daily custom integrations. The MCP protocol went from 100K to 8M downloads in six months — unprecedented adoption for plumbing infrastructure.

The question isn’t whether Slack, Notion, and Claude.AI will survive the AI wave. It’s whether the hundreds of specialized tools competing for attention understand that the game has changed. Users have less tolerance for broad-based general tools than they once did. The platforms that capture UX mindshare will get most of the traffic, even if APIs and agents do the actual work behind the scenes.

The evidence is clear from user behavior: they don’t want to learn a new search interface, remember another login, or context-switch to yet another tab. They want the intelligence layer to meet them where they already are.

The Source of Truth Problem

Most organizations have a source-of-truth problem they haven’t fully articulated. They have Slack for real-time communication. They have Notion or Confluence for documentation. They have Google Docs for drafts that become documents that become outdated that stay around anyway. They have JIRA for tickets that may or may not reflect what was actually decided. They call this a “knowledge management system.” It’s more accurately a distributed archive of partially-intentional artifacts with no clear authority hierarchy.

The question “who owns this decision?” leads to a Slack thread from eight months ago, a Notion page that three people edited and nobody is certain is current, and a Google Doc someone linked in a comment that requires permission to access. This is the status quo. It functions, after a fashion, because humans are good at triangulating across ambiguous sources and asking colleagues to fill gaps.

AI agents are not good at this. They will confidently synthesize the eight-month-old Slack thread with the outdated Notion page and present the result as a coherent answer. The errors won’t be obvious. They’ll be subtly wrong in ways that require domain expertise to catch.

The source of truth problem was always real. It was manageable when every query ran through a human brain. It becomes actively dangerous when queries run through an inference layer first.

What you actually need — what organizations are starting to build — is a repository where the data structure enforces truth. Not a place where the right answer might be findable if you look hard enough. A place where the structure of the data makes the wrong answer harder to produce.

But here’s the connector insight: that structured repository doesn’t need to be where users spend their time. It can be the authoritative backend that feeds connectors in the platforms users already inhabit.

Where Users Actually Live

User attention has consolidated around four core platforms:

Slack: Real-time coordination, team presence, ephemeral decisions. 32.3 million daily active users with 550K+ custom integrations daily.

Email + Office Suite: Formal communication, document collaboration, external stakeholder interface. Microsoft reports 400M+ Office 365 commercial users.

Notion: Knowledge management, project tracking, collaborative documentation. 100M+ users consolidating entire productivity stacks.

Claude.AI: AI assistance, analysis, content generation. Rapidly becoming the default interface for LLM interactions across knowledge work.

Each platform serves a legitimate core function. Tool builders make the mistake of assuming they can compete for primary platform status by building something better. Users are done adopting new primary platforms. They’re consolidating around tools that already have their attention.

The pattern reveals a deeper truth: people live in transactional systems, not knowledge systems. Slack is where decisions happen. Email is where approvals flow. Claude.AI is where analysis gets done. These are transactional - work happens there daily.

Confluence is a perfectly good wiki tool. But it’s knowledge-at-rest, not transactional. People don’t live there. They visit when forced to document something, then return to their transactional workflows. The knowledge gets stale because maintenance happens in a different system than usage. (Notion manages to straddle the line between knowledge at rest and transactional)

Integration platforms like Zapier understand this - they connect 8,000+ apps with 3.4M+ business users by bringing specialized functionality to existing workflows rather than creating new destinations.

Users just want the data, dammit. They don’t want to learn your interface.

The Connector Infrastructure Moment

What changed? Three pieces of infrastructure matured simultaneously:

Slack Manifest Tool makes organization-specific bots trivial to build. The manifest.yaml format standardizes permissions, scopes, and deployment. Weeks of OAuth wrestling became hours of configuration.

MCP Protocol achieved “USB-C for AI” universal connectivity. Claude.AI, ChatGPT, and dozens of platforms support the same connector format. Build once, deploy everywhere. The 100K to 8M download growth in six months reflects pent-up demand.

LLM Services like Bedrock and OpenRouter provide natural language interfaces that make connectors intelligent rather than just data pipes. Ask questions in plain English, get structured responses, maintain conversation context.

Semantic Search Infrastructure like mcp-vector-search can be repurposed as hosted services, adding intelligence layers that understand meaning rather than just matching keywords. This transforms basic data access into contextual knowledge retrieval — a crucial enabler for connectors that need to surface relevant information rather than exact matches.

Combined, you can build a production connector in a single afternoon. Slack manifest defines the bot interface. MCP schema defines the data sources. Semantic search handles intelligent retrieval. Bedrock provides the language understanding. Deploy to AWS Lambda and you’re live.

My three connectors follow this exact pattern. The engineering search connector repurposes mcp-vector-search as a hosted service with all-MiniLM-L6-v2 embeddings for semantic and knowledge graph search, but the user interface is just Slack commands and Claude.AI MCP tools. The CRM data connector is a headless AWS service that makes customer data available through natural language queries in Claude.AI. The document workflow connector provides git workflows through a web UI that non-technical users can navigate.

Each connector took 4-6 hours to build. Each would have taken 4-6 months to build as a standalone application with user management, authentication, interface design, mobile responsiveness, and all the infrastructure a “real app” requires.

The Democratization Effect: The infrastructure shift goes beyond development speed — it’s democratizing who can build sophisticated integrations. What once required senior engineers with deep API knowledge can now be handled by bootcamp graduates following established patterns. I built these first three connectors to validate the approach, but similar projects will go to junior engineers going forward.

This changes resource allocation fundamentally. Organizations can solve integration problems without burning senior engineering cycles on “plumbing” work. Information that was once very hard to obtain is now trivial to access.

The economics are compelling. Building three production connectors cost roughly $1,000-2,500 in AI tokens over 44 days. Traditional contractor development for equivalent functionality would have run $150-300K+. The connector approach isn’t just faster — it’s 100x more cost-effective.

The adoption metrics prove the value. The CRM connector launched March 18th with 23 invocations on day one. No formal rollout, no training sessions, no onboarding docs. Just organic discovery across a 300+ person company. By week two, daily usage tripled to 95 invocations per day. Tuesday hit 152 invocations — including a 40-query analysis session in a single hour. That’s 299 queries in 7 days with zero errors, from a connector that took 4-6 hours to build.

The era isn’t about choosing between platforms. It’s about connecting specialized intelligence to the platforms users have already chosen.

Why Wikis Can’t Compete in the Connector World

Traditional knowledge management tools face a structural mismatch in connector architecture. Wikis assume users will “go to the tool” for information. Connectors flip that assumption: the tool comes to the user.

This creates specific problems:

The Authoring/Retrieval Tension: Wikis optimize for collaborative authoring — anybody can edit, flexible structure, link everything, evolve over time. This is the opposite of what retrieval needs: consistent schema, clear ownership, explicit governance. When you pipe wiki content through a connector, you inherit all the inconsistencies that collaborative authoring creates.

Search Architecture Limitations: Confluence’s search is notoriously bad because it does keyword matching on unstructured text. This was problematic before LLMs. With LLM-powered connectors, it becomes worse because the AI layer adds confidence to bad retrieval results. Users get wrong answers delivered with conviction.

Static Data Problem: Notion’s AI operates on static content snapshots, disconnected from real-time operational state. When CRM connectors query “What’s our pipeline this quarter?” through a Notion connector, it’s answering based on what someone wrote about the pipeline, not live customer data. The connector amplifies the staleness problem.

Governance at Scale: Wiki governance defaults to “community-maintained,” which means in practice nobody is responsible for accuracy. As organizations scale, wikis accumulate pages nobody knows are outdated. Connectors don’t solve this — they accelerate the distribution of stale information.

Our structured document framework represents the alternative: git-backed Markdown with schema-validated YAML frontmatter. Every document has explicit metadata: owner, status, domain, confidence, time_box. The structure is the feature. When document workflow connectors expose this, the schema ensures consistent data quality regardless of interface.

Structured document repositories outperform wikis for AI query by 35-60% in controlled tests. Clean Markdown with explicit metadata reduces token usage by 20-30% and improves retrieval accuracy significantly. This isn’t philosophical — it’s measurable.

Wikis remain useful for collaborative drafting and evolving reference material. But they’re not the right backend for connector architecture. The connector era requires structured data sources that can maintain quality across multiple interface layers.

Building Connectors vs. Standalone Tools

The strategic choice organizations face isn’t “which tool should we build?” but “should we build a tool or a connector?” My experience with three connectors illuminates the trade-offs:

Engineering search could have been a standalone search platform. Instead, it’s accessible through Slack commands, CLI tools, web interface for visualizations, and MCP tools for Claude.AI sessions. Same search capability, four different interaction models depending on user context.

CRM integration could have been a dashboard with charts and filters. Instead, it’s a headless MCP service that makes customer data available through natural language in Claude.AI. Ask “Show me deals over $100K in our target vertical” and get live results in 1-3 seconds. No dashboard to learn, no visual interface to maintain.

Document workflows could have been a product management SaaS platform. Instead, it’s a guided workflow that helps non-technical contributors interact with existing git-backed document frameworks. Browse artifacts, generate AI summaries, submit PRs — all through interfaces that match users’ technical comfort levels.

Specialized intelligence delivered through platforms users already inhabit. The connector approach wins on several fronts.

Development time: 4-6 hours vs. months. No user management, authentication, responsive design, or mobile apps to build.

Adoption friction: Zero onboarding. No new logins, training sessions, or change management overhead.

Maintenance burden: Focus on data logic and intelligence, not interface maintenance across device types and browser versions.

Integration: Connectors compose naturally with existing workflows. Slack discussions can include live Salesforce data. Claude.AI analysis can pull from engineering knowledge graphs. Standalone tools require export/import workflows.

The business case is compelling: connector development costs 10-20% of standalone application development while achieving 3-4x higher user engagement.

The implications go beyond development efficiency. Users have less tolerance for “broad-based general tools” than they once did. Managing dozens of application contexts creates unsustainable cognitive load. Platforms that capture daily attention get most of the traffic, even when APIs and agents do the computational work behind the scenes.

This creates different winner-take-all dynamics. The winners aren’t necessarily the best tools. They’re the platforms users choose to inhabit, plus the connectors that bring specialized capability to those platforms.

What This Means for Your Stack

The connector era doesn’t eliminate existing tools — it clarifies their appropriate roles and challenges their assumptions about user attention.

Slack keeps its coordination function: Real-time presence, threading, ephemeral decisions. But it becomes a command interface for structured data sources rather than a knowledge repository itself.

Notion retains collaborative authoring value: Drafting, evolving documentation, reference material. But it stops being the “source of truth” for operational decisions. That role shifts to structured backends accessible through Notion connectors.

Specialized tools survive by becoming intelligent backends: Your CRM, your monitoring system, your code repositories — these maintain their core data authority. But user interaction shifts to connector layers in platforms where users already work.

The question to ask about any tool: Is this where I want an AI agent pointing when it needs authoritative information? If the answer is no, it’s not your source of truth. It might still be valuable — as a backend, as a collaborative space, as a specialized interface for expert users. But it doesn’t earn the designation of “primary platform.”

The organizational challenge: Getting non-technical teams comfortable with structured data workflows is real change management. Document workflow connectors address this by providing guided interfaces for git-backed workflows. But someone still needs to own schema design and governance processes.

Who should build connectors first: Engineering-adjacent teams with strong PM-engineering collaboration. Organizations where AI hallucination on operational decisions creates measurable cost. Companies that have already felt the pain of distributed knowledge management.

Timing matters: Most organizations haven’t built connector strategies yet. Companies that establish structured knowledge backends with connector frontends in 2026 will have 12-18 months of advantage when AI-mediated query becomes standard practice.

The connector era isn’t about choosing between platforms. It’s about connecting intelligent backends to platforms users have already chosen. Organizations that get this right will operate with less context switching and faster access to operational data.

Users just want the data, dammit. The question is: will you bring it to them, or keep expecting them to come to you?

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

One Year of HyperDev

Robert Matsuoka — Fri, 20 Mar 2026 12:31:14 GMT

A retrospective on building in public during the great AI coding transformation

One year ago, “50 First Dates with Claude Code“ took me several hours to write with Joanie’s help in Google Docs. This morning, I drafted two comprehensive Anthropic articles using claude-mpm agents in 45 minutes - research, generation, editing, the works.

That evolution mirrors exactly what happened across the entire AI coding industry this year (and the ascension of Claude Code, which was released shortly before I began my journey). I wasn’t just writing about the transformation. I was living it.

The Journey: From Accident to Infrastructure

HyperDev started by accident. In March 2025, I posted a LinkedIn experiment about spending 12 hours building a travel planning app with AI tools. I was a technology executive who “hadn’t coded seriously in 20 years” testing whether the productivity claims were real or hype.

The response was immediate and intense. Richard Wang left a prescient comment: “’AI allows a non-engineer to build a product without coding’ is hype... ‘AI can improve a developer’s productivity by 10x’ is true.” That experiment generated 4,000 lines of AI code in a single session and launched what became 168 articles over nine months.

April through June was chaos. I built claude-multiagent-pm, a prototype that worked well enough to be exciting and poorly enough to be frustrating. Token costs were obscene - every subprocess inherited the entire conversation context. I shipped 44 repositories. Probably a third represent false starts or abandoned approaches.

But that’s what I was learning: what not to build. Which constraints matter. Where the sharp edges live. The breakthrough insight from this period: Infrastructure beats features. The tools that demo well (flashy autocomplete, pretty interfaces) weren’t the ones that sustained daily use. The protocols, memory systems, and context management layers were what made sustained multi-agent work possible.

The Breakthrough: When Everything Changed

Mid-July 2025, Claude Code shipped context filtering. Sounds like a minor technical detail. It changed everything.

Before: my prototype burned tokens like a furnace and required constant babysitting.
After: I rebuilt everything. claude-mpm emerged with 1,545 commits over the rest of the year.

I remember the specific moment it hit me: I’d just pushed a feature that I would never have taken on without a team. Four hours of engaged time, a few days of agentic time. Twenty years away from serious coding. Four months back. Contributing production code for paying clients.

The tools weren’t just making me more productive. They were making me more ambitious. I was taking on problems that required sustained, complex thinking because I had AI teammates that could handle the execution details while I focused on architecture and strategy.

This is when my perspective shifted from “AI coding tools are interesting” to “AI coding tools are transformative.” Not because they eliminated the need for programming knowledge, but because they amplified existing knowledge into production-quality artifacts.

The writing and building formed a virtuous cycle. I documented what I learned building tools. The documentation attracted practitioners who used the tools. Their feedback improved the tools. claude-mpm gained 30+ stars and daily use across six months of client work. These weren’t GitHub tourism projects - they were tools that other practitioners adopted because they solved real problems I’d discovered through real use.

The Evidence: Predictions and Numbers

Looking back through a year of articles, my prediction accuracy was surprisingly good.

What I got right:

Multi-agent orchestration would prove superior to monolithic assistants (claude-mpm’s adoption validated this)
Infrastructure over features as the determining factor for tool longevity (memory systems outlasted flashy demos)
CLI-agentic coding going mainstream (Claude Code’s 46% “most loved” rating proved the thesis)
Pricing correction timing (18-24 months from October 2025 - signals are clear at six months)

What surprised even me:

Speed of Claude Code’s dominance (faster than even advocates expected)
Writing-building credibility loop (thought leadership through shipping, not just analysis)

The quantified impact tells its own story:

4,919 commits in nine months of sustained development
69.7 billion tokens processed across all tools and projects
198 articles published (3.2/week sustained)
547 production deployments across client and personal projects
$45,000 in AI compute at rack rates (subsidized to $8,000)

But the qualitative transformation matters more. I evolved from asking “Is this real?” to asking “How do we scale this organizationally?”

The Transformation: From Observer to Practitioner to Leader

The most significant development was personal: in January 2026, I joined Duetto as CTO.

This wasn’t a career change - it was an expansion. Twenty-five years of technology leadership combined with eight months of hands-on AI development created a unique perspective. I wasn’t returning to technology leadership despite the AI work. I was taking the role because of it.

What changed wasn’t my capabilities - I’ve been leading technology teams for decades. What changed was having AI as a force multiplier that converted that knowledge into actual artifacts. I could prototype entire systems, validate approaches, and demonstrate concepts that previously would have required dedicated engineering resources.

The leadership experience informed architectural decisions. The practitioner activity created credibility. The writing documented both. Now I get to test everything I’ve been writing about at enterprise scale.

The Reality Check: Both/And, Not Either/Or

October 2025, I published “Is AI A Bubble? I Didn’t Think So Until I Heard Of SDD.” The piece synthesized something I’d been wrestling with: how can genuine transformation and bubble dynamics exist simultaneously?

The answer: they can. And do.

The AI coding revolution is real. The bubble dynamics are also real. Codeium at 70x ARR multiples (vs dot-com peak of 18x) while providing genuine value to practitioners. My $45,000 in AI compute costs exemplified both the unsustainable economics and the genuine value creation. The 82% subsidy rate can’t last, but the ROI still works at full rates for sustained professional use.

Companies with product-market fit and operational discipline will survive the correction. Those burning capital on “technical potential” without user adoption won’t. The technology remains transformative even if the valuations prove unsustainable.

What’s Next: From Individual Productivity to Organizational Transformation

The industry is transitioning from individual productivity tools to enterprise transformation frameworks. The early adopters who mastered AI-assisted development workflows now face a different challenge: scaling those practices across entire engineering organizations.

The questions have evolved:

Year 1: “Can AI tools make me more productive?”
Year 2: “How do we maintain code quality and security with AI-generated code?”
Year 3: “How do we transform hiring, onboarding, and career development when AI changes what programming means?”

At Duetto, I’m working on these questions at scale. How do you migrate an enterprise engineering team from traditional development practices to AI-assisted workflows while maintaining operational excellence? How do you balance productivity gains with governance requirements? How do you rethink technical leadership when junior developers can ship senior-quality code with AI assistance?

These are the infrastructure problems that matter now. Not the tools themselves, but the organizational systems that make the tools effective at scale.

my “Green Wall”

The Retrospective: What One Year Taught Me

HyperDev became valuable because it documented the transformation in real-time, from the perspective of someone living it. Not retrospective analysis of what happened, but contemporary documentation of what was happening.

The key insights:

Timing matters. Starting documentation right at the inflection point captured both the chaos and the consolidation. Personal journey paralleled industry maturation.

Practitioner perspective beats observer perspective. Direct experience with the tools, including their limitations and sharp edges, generated insights that pure analysis couldn’t match.

Building creates credibility. Shipping tools that other practitioners adopt generates more authority than analytical commentary alone.

Writing and building amplify each other. Documentation of practice creates thought leadership. Thought leadership creates opportunities. Opportunities create more practice to document.

One year ago, I was asking whether AI coding tools were genuinely transformative or just sophisticated autocomplete. The answer: they’re genuinely transformative, but in ways that none of us fully anticipated. The transformation wasn’t about eliminating the need for programming knowledge. It was about amplifying existing knowledge into production-quality artifacts faster than previously possible.

Most importantly, the transformation was about enabling individual practitioners to think and build at organizational scale. That’s what I experienced personally. That’s what I documented in nearly 200 articles. And that’s what I’m now implementing at enterprise scale.

HyperDev year one was about learning what was possible. Year two is about making it practical. Year three might be about making it inevitable.

The tools keep getting better. The workflows keep evolving. The organizational challenges keep getting more complex.

And I’m still here, still building, still documenting, now leading, still learning.

What a year it’s been. What a year it’s going to be.

HyperDev documents the real-world application of AI development tools by practitioners building production systems. For technical deep dives and business analysis of the tools behind this transformation, look for upcoming coverage of Claude Code’s competitive dominance.

Everyone Blamed Clawd Bot’s Execution. The Concept Was the Problem.

Robert Matsuoka — Thu, 12 Mar 2026 11:32:09 GMT

The story everyone told about Clawd Bot missed the point entirely. Austrian developer Peter Steinberger built an open-source AI assistant that went viral — 145,000 GitHub stars, 2 million visitors in a week. Then Anthropic forced a trademark-based name change because “Clawd” was too similar to “Claude.” The community called it petty. DHH called Anthropic “customer hostile.” The irony: Clawd Bot users were actually buying more Claude subscriptions, providing free marketing to Anthropic, yet they still demanded the shutdown.

But everyone focused on the wrong drama. The trademark dispute was noise. The real problem was deeper: Clawd Bot was built because someone could, not because anyone needed it.

I tested Clawd Bot for about a week. The interface was clean, the onboarding smooth, the responses capable. But it required permissions I wouldn’t give to any tool — access to email, calendars, messaging, sensitive services. The execution had real problems. But even if those were fixed, it would still be solving the wrong problem.

Here’s where I should admit: I tried building a digital assistant, izzie, when I started experimenting with AI agents. I never got it to a point I found useful. Not because of technical limitations — because the entire concept of a universal assistant doesn’t match how work actually happens.

TL;DR

Clawd Bot was successful open-source project by Peter Steinberger that Anthropic forced to rename; execution wasn’t the problem
The real question: when do you need an “assistant”? Most execs won’t trust AI scheduling; the value is intelligent data movement between services
Context switching is a symptom, not the root issue — the issue is what assistants should be doing at all
The product management sessions: Granola meeting notes, calendar checks, Slack updates, Notion sync — all from within one tool, data flowing intelligently between services
The commercial evidence: Cursor, Notion AI, Linear’s AI triage — the winners embedded AI in tools as infrastructure, not interface
trusty-izzie’s highest value isn’t the chat interface — it’s as a local MCP service exporting personal context to every other tool
The universal assistant category isn’t going to produce a winner. It’s going to dissolve.

What the Universal Assistant Model Gets Wrong (And It’s Not Just Execution)

Clawd Bot had serious execution problems — it’s a security nightmare requiring broad permissions across email, calendars, messaging platforms, and sensitive services. You can’t ignore that. But even if the security issues were solved, universal assistants face a deeper structural problem: they assume people need an assistant in the traditional sense.

Walk through what even a well-executed version of the same product model looks like.

Smooth onboarding. Crystal-clear use cases. High-quality AI responses. Clean interface design. Users know exactly what to ask and how to ask it.

You still have to leave whatever you’re working on to use it. And when you do, the context you were carrying — the code you were reviewing, the initiative you were drafting, the design decision you were working through — is no longer present. You’ve moved somewhere that knows nothing about any of that.

So you explain. “I’m working on the ‘YYY’ data ingestion initiative, and I need to check whether the points Mark raised in Tuesday’s meeting are addressed in the current design.” The assistant doesn’t know what ‘YYY’ is. Doesn’t have Tuesday’s meeting. Doesn’t know Mark, the current design, or the organizational context that makes “addressed” mean something specific. You load all of it by hand.

In demos, this overhead is invisible. Demo tasks are self-contained by design — the context fits in a sentence or two. In practice, your working context isn’t self-contained. It’s weeks of accumulated decisions, relationships, dependencies, and constraints that live distributed across your tools. You can’t paste it into a chat window. You can’t even fully articulate it. It’s partially tacit, partially in documents, partially in the history of the tool you’re using.

The Session That Clarified It

A few weeks into the new role at Duetto, I was doing product management work in a claude-mpm session — reviewing open initiatives, managing the PR queue, creating proposals. Standard operational work for a new CTO getting oriented.

I wanted to add an infrastructure initiative. Cloud Dev Sleds — dedicated cloud development machines for the engineering team. The context was in a meeting I’d had the day before. In the old workflow, this would mean: switch to Granola, find the right meeting, read the transcript, extract the relevant points, switch back, and then write the initiative with that context now loaded in my head rather than in the tool.

Instead I just asked: “Review my meeting with Mark yesterday in Granola to get context. I want to create the initiative as a feasibility, cost, and LOE assessment.”

The tool pulled the notes. I created the initiative. The product context — what other infrastructure work was in flight, what the team structure looked like, what the related architectural decisions were — never left. The Granola content landed inside that context rather than requiring me to carry it manually between tools.

Same session: needed to check whether I had a conflict for an upcoming demo. Calendar check, without opening Google Calendar.

Same session: the team needed a status update. Posted directly to the engineering Slack channel, with proper <@USERID> mentions so people actually got notified. The message reflected the same initiatives I’d been working on all session — not because I copy-pasted anything, but because the tool already knew what was in flight.

Later: set up a Notion sync — initiative statuses with links to the docs, updated automatically.

The efficiency argument is real but secondary. The more important thing is that the product context never left. The tool knew what initiatives existed, who owned what, what the architectural decisions were, which PRs were waiting on which engineers. When I pulled Granola notes, they arrived inside that context. When I posted to Slack, the message was informed by that context. A universal assistant would have required me to reconstruct and transport that context manually every time I needed to cross a tool boundary.

No universal assistant is going to have that work knowledge. Not because the AI isn’t capable. Because the knowledge lives in the tool, accumulated over months — PRDs, design decisions, initiative history, team assignments, the proposals that got approved and the ones that didn’t. You don’t recreate that in a chat window.

The Deep Context Problem

The thing that makes domain tools irreplaceable isn’t AI capability. It’s accumulated context.

A product management tool carries months of initiative history. The CTO knowledge base carries organizational decisions, vendor relationships, strategic context that builds over time. These aren’t things you can summarize in a system prompt. They’re queryable, interconnected, grounded in real artifacts. The tool has developed something like institutional memory — and that memory is what makes AI assistance inside the tool qualitatively different from AI assistance outside it.

Universal assistants are built for breadth. Any question, any domain, any task. That breadth is the pitch and also the structural weakness. The model that’s ready for anything is primed for nothing specifically. It has no idea that “the YYY initiative” refers to a specific ingestion redesign with a particular set of constraints, a particular set of people involved, and three months of design decisions behind it.

The inversion worth stating plainly: the tools you work in every day already have more relevant context than any assistant will. The right move is surfacing AI capabilities inside those tools, not pulling people out of those tools into a separate assistant layer.

But here’s what’s happening at the executive level. I’m finding more and more technical executives using Claude Code as knowledge assistance — not because they’re universal assistants, but because the amount of data and complexity they can manage far exceeds what standard off-the-shelf tools provide. The deep context problem can’t be solved with generic solutions.

For MPM, I built specific connectors: gworkspace-mcp, slack-mpm, notion-mpm, granola-mcp (the last from Granola, the others myself because mcp has limitations). That became as much of an “assistant” as I needed, besides izzie. No universal chat interface. Just targeted data bridges that let Claude access specific services when I’m working on something that needs their context.

The commercial evidence points the same direction. The AI tooling products with real adoption aren’t universal assistants. Cursor put AI in the editor. Notion AI put AI in the documents. Linear’s triage put AI in the issue tracker. Each works because the AI operates inside existing context. The pattern is consistent enough that it’s probably not coincidence.

Interface vs Infrastrccture

What I Got Wrong About My Own Bot

I (re)built trusty-izzie as a personal assistant — natural language queries over my email and calendar history, local graph database, vector embeddings, stays on my machine. It works. But “personal assistant” was the wrong frame for where the value lies.

The thing izzie has is a grounded, real-time, locally-stored representation of my professional life — people, relationships, projects, scheduling, communications history. That’s a context store. Every tool I use should have access to it without me switching to izzie to ask.

The right version of izzie isn’t the one you talk to. It’s the one that runs as a local MCP service — always on, queryable by anything that needs personal context. The product management tool asks it about scheduling. The writing environment surfaces relevant prior conversations. The coding environment knows who owns what system before I have to explain it. None of that requires me to open izzie. It requires izzie to be infrastructure rather than interface.

If you want to try izzie yourself: izzie.bot has the details, and the full source is at github.com/bobmatnyc/trusty-izzie. I strongly recommend building from source using an agentic coder to verify the code is safe — never trust AI tooling with your personal data without auditing it first.

Not there yet. But the frame shift changes what to build next.

What the Architecture Looks Like

If you’re building a personal AI tool, the question isn’t “what will users ask the assistant?” It’s “where do users have context, and how do you bring assistance there without making them leave?”

The test is simple. Does using your tool require leaving the context where the relevant information lives? If yes, you’re fighting the architecture. Users will use it occasionally, for low-friction tasks. They won’t build their workflow around it.

The tools that pass the test: Claude Code (your codebase is the context), Cursor (you stay in the editor), Notion AI (you stay in the document), Linear AI triage (you stay in the issue tracker). The tools that fail it: every standalone AI assistant that requires opening a new interface and re-explaining what you’re working on.

For domain tools with real depth — months of accumulated decisions, relationships, history — the connectors are the product. The LLM orchestration is the interface layer. The accumulated context is what no competitor can replicate by building a better general assistant. The moat isn’t the AI. It’s what the AI is operating inside.

For personal infrastructure like izzie: build the MCP service before the chat UI. The chat UI is useful and I use it. The MCP service is what makes the tool true infrastructure rather than one more thing to switch to.

The universal assistant category isn’t going to produce a winner because the category is structured wrong. The capabilities will get absorbed by the tools where the relevant context lives — because that’s where the value is, and users will figure that out even if product teams don’t. The infrastructure driving this — entity and relationship detection, email, calendar, and task management (all built for Izzie) — will likely be delivered by the personal productivity tool providers (hello Google).

Clawd Bot wasn’t a failed product. It was wildly popular, but I suspect will have been a flash in the pan once the shininess wears off and the liabilities outweigh the usefulness. That distinction matters, because if you think it’s an execution problem, you go looking for a better universal assistant. If you understand it’s a conceptual problem — that most “assistant” work is intelligent data movement — you build infrastructure instead of interfaces.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

What Does A Pattern Master Do? — The role of expertise in AI development
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

MCP Was a Brilliant Idea — But It Needs a Proper API Behind It

Robert Matsuoka — Tue, 10 Mar 2026 11:30:57 GMT

The pattern shows up constantly when I look at MCP server implementations. Someone discovers the protocol, gets excited about giving agents tool access, builds a server in a weekend, ships it to the registry. Six tools. Maybe eight. Each one is basically a direct passthrough to whatever SDK the underlying service provides.

And for about a week, it feels like it works.

Then the agent needs to do something real. Archive 400 emails from a specific sender. Pull all calendar events for Q1, cross-reference them with a project timeline, and generate a summary. Move a batch of files across Drive folders. The agent starts calling tools in sequence, hits rate limits, gets confused about pagination, makes the same API call twelve times trying to work around a 50-item response limit that the MCP tool never exposed as a parameter. Eventually it either fails or produces something partially wrong, and nobody’s quite sure where the breakdown happened.

The bottleneck isn’t MCP. The protocol did exactly what it was supposed to do — it gave the agent a clean interface for calling tools. The bottleneck is what’s behind the MCP server.

I’ve built a lot of these now. gworkspace-mcp has 115 tools across Gmail, Calendar, Drive, Docs, Sheets, Slides, and Tasks. slack-mpm has 40+ tools plus a full async Python API library underneath that can run entirely without an agent in the loop. The gap between those projects and most of the reference MCP servers I’ve seen is not complexity — it’s architecture. Specifically: whether there’s a real API underneath the MCP layer, or whether the MCP tools ARE the implementation.

That distinction matters more than almost anything else when you’re building tools that agents will actually use in production.

TL;DR

MCP servers built as thin wrappers over service SDKs hit hard ceilings when agents need to do anything at volume or across operations
The reference Slack MCP server has 8 tools; a production implementation needs 40+, with a real API library underneath it
The three-layer pattern (API → MCP → Skills) has a specific job at each layer — remove any one and the system degrades in a predictable way
Tool description quality is the single biggest lever on agent behavior; bad descriptions produce bad decisions regardless of what’s underneath
Thin wrappers are fine for prototypes and read-light tools; the inflection point is when you want to write a script that does what the agent does

The Official Slack MCP Server Problem

The reference Slack MCP server — currently maintained by Zencoder after leaving the official MCP registry — offers eight tools. List channels. Post a message. Reply to a thread. Add a reaction. Get channel history. Get thread replies. Get users. Get a user profile. That’s it.

For a demo, that’s fine. For anything agents actually need to do with Slack, it hits walls quickly.

My slack-mpm server covers 40+ tools: search messages by date range and keyword, manage bookmarks, set reminders, handle scheduled messages, list workspace members with filtering, manage file uploads, archive channels. The implementation underneath is a clean async Python API — 47 functions across eight modules — that you can call directly from scripts without an agent in the loop at all.

The functional gap is obvious enough. What’s less obvious is why it exists.

The reference server isn’t thin because Slack’s API is thin. Slack’s API is extensive. The server is thin because it was built without a real API library underneath it. The MCP tools are the implementation — there’s no abstraction layer, no pagination handling, no rate limit management, no batch operation support. Each tool calls the Slack SDK directly and returns the result.

That works for eight tools. It doesn’t scale to forty because the complexity you’re hiding from the agent — auth edge cases, cursor-based pagination, retry logic on rate limits, handling the difference between bot tokens and user tokens — has nowhere to live. There’s no library to put it in. So you either skip the complex operations entirely, or you dump that complexity into the MCP tool handler itself, which makes the tool fragile and hard to maintain.

The reference server chose the first option. Which is reasonable for a reference — but it means agents using it can’t search Slack properly (search requires user tokens; the server only supports bot tokens), can’t do bulk operations, can’t run scheduled tasks, can’t be used programmatically outside of an agent context.

This isn’t a knock on the people who built it. It’s a knock on the pattern of treating MCP as the architecture rather than the interface.

Phil Schmid made a similar observation in January in a piece called MCP is Not the Problem, It’s Your Server: “MCP servers are not thin wrappers around your existing API. A good REST API is not a good MCP server.” Correct, but it doesn’t go quite far enough. The problem isn’t just that REST APIs make bad MCP servers — it’s that MCP servers without any abstraction layer underneath them make bad tools, regardless of what the underlying service looks like.

What a Real API Gives You

When I started building gworkspace-mcp, I made a decision early that turned out to be foundational: build the Google Workspace API library first, then write the MCP server as a thin interface on top of it. The API library handles auth, pagination, rate limiting, and error normalization. The MCP tools are mostly one-liners that call the right API function and return the result.

That decision shows up in five specific ways.

Pagination at the library level. Gmail’s API returns 50 messages per page by default. If an agent wants to archive everything from a specific sender over the past six months, that might be 400 messages across eight API calls. If the MCP tool handles pagination — which means the API library handles it — the agent calls one tool and gets back 400 message IDs. If pagination isn’t handled, the agent manages cursor iteration itself: call the tool, get 50 results, extract the cursor, call again, repeat. Agents do this badly. They lose track of cursors, make redundant calls, or give up after the first page and tell you they found the 50 most recent messages.

Rate limit management that doesn’t leak up. Rate limits handled in the API layer are invisible to the agent. The tool call either succeeds or returns a clean error. Rate limits handled in the tool handler either block the agent or require the tool description to explain retry patterns — which agents then implement inconsistently. The complexity belongs in the API layer. That’s the only place it can be handled reliably and tested against real behavior.

Reuse across contexts. The slack-mpm API library runs five standalone scripts — archiver, digest, listener, notifier, responder — that operate on schedules without any MCP involvement. The same code that handles pagination and auth in the MCP context handles it in the cron job context. This isn’t a nice-to-have: it means the code is continuously exercised against real-world conditions, not just when an agent happens to call it.

Actual testability. You can write unit tests for an API library. You can mock the underlying service calls, test pagination edge cases, verify that rate limit handling works correctly. Testing an MCP tool handler without running an agent session is hard to do meaningfully — you end up testing it by watching it fail in production. The difference in reliability compounds over time, and it compounds fast.

Composability. Some operations are inherently multi-step. Finding all calendar events associated with a project and generating a summary requires fetching from multiple calendars, filtering by keyword, sorting by date, and formatting the output. That can live in the API layer as a single higher-order function. The MCP tool calls the function. The agent sees one tool call that returns a clean result instead of orchestrating a dozen calls and trying to assemble the output itself.

Aditya Mehra put it well in a December 2025 piece on production MCP architecture: “Design for what agents need to accomplish, not for what APIs happen to exist. Your APIs were designed for developers building applications. Your MCP servers should be designed for agents completing tasks.” The API layer is where you do that translation. The MCP layer is where you expose the result.

The Three-Layer Pattern

The architecture I’ve converged on has three distinct layers, each with a specific job. Removing any one of them degrades the system in a predictable way.

Layer 1 — The API library. This is where the complexity lives. Auth handling, including token refresh and the difference between scoped token types. Pagination, including cursor management and automatic result aggregation. Rate limit management, with exponential backoff and respect for per-method limits. Error normalization, so the MCP layer receives clean, typed errors rather than raw API exceptions. Batch operations, so the agent can request 400 results and the API handles chunking appropriately.

The design test for this layer: you should be able to write useful scripts against the API library without involving an agent at all. If you can’t, the abstraction is at the wrong level.

Layer 2 — The MCP server. Thin. The tool handler should be almost trivially simple: validate inputs, call the API function, return the result. If a tool handler is doing significant work, something belongs in the API layer instead. Tool descriptions are not trivial — they’re the interface contract with the agent and deserve careful attention — but the execution path should be short. A handler that’s more than twenty lines of real logic is usually a sign something is in the wrong place.

The design test here: each tool should do one thing, and it should be obvious which tool to use for a given operation. When agents have to guess between tools, they guess wrong.

Layer 3 — The skills document. This is the layer most implementations skip, and it’s often where production agent behavior falls apart. The skills document tells the agent how to use the MCP tools effectively: what tools exist, when to use each one, which combinations work well together, what to avoid.

Without it, agents discover capabilities by trial and error — hitting rate limits unnecessarily, calling the wrong tool for the job, making redundant calls when one batched call would do. With it, agents start from a baseline of competent behavior and only deviate when they encounter something they haven’t hit before.

The skills document is institutional knowledge in structured form. It captures what took me hours of iteration to learn about each service — which Gmail search operators work reliably, when to use Drive’s query syntax versus simple name search, how to structure a Sheets batch update to avoid cell reference errors. That knowledge doesn’t exist anywhere else. It lives in the skills document or it doesn’t exist, and the agent stumbles into the same mistakes I made during development.

The MCP community is starting to recognize the description-as-instruction principle. Schmid’s framing is that “every piece of text is part of the agent’s context.” True, but individual tool descriptions can only carry so much. The skills document is where higher-order guidance lives — how to think about sequencing operations, when not to use a tool, what the common failure modes look like. Think of it as runtime instructions for agents, not documentation for humans.

When MCP Alone Is Enough

There are cases where thin MCP wrappers are the right call, and it’s worth being direct about them.

Simple, low-volume reads. If an agent needs to check weather, query a single record from an external service, or look up one user profile, a thin wrapper is probably fine. The complexity ceiling exists but may never be reached. Building a full API layer for a tool that makes one API call per agent turn is engineering overhead that doesn’t pay off.

Prototyping and exploration. A server built in a day is often the right first step because you don’t know yet which operations the agent will actually need. I’ve shipped thin wrappers deliberately as a way to learn before investing in a proper API library. The Zencoder Slack server probably started that way. The mistake isn’t building a thin wrapper for exploration — it’s leaving it there when the agent starts doing real work and the wrapper’s limits start showing.

Single-agent, single-purpose tools. If a tool is purpose-built for one agent doing one thing and the scope is genuinely narrow, the three-layer overhead may not be worth it. The architecture makes sense when tools need to be reused across contexts, when operations are high-volume, or when the underlying service is rate-sensitive.

Read-heavy, write-light operations. The complexity of batch operations, cursor management, and retry logic matters most when you’re writing or doing high-volume reads. A tool that fetches a single resource per agent turn doesn’t need much abstraction.

The honest signal for when you need the full pattern is one of three things: you find yourself wanting to write a script that does what the agent does, the agent hits the same rate limit more than once in a session, or you start duplicating error handling logic across tool handlers. Any of those is the inflection point. At that moment, adding the API layer is less work than continuing without it.

Building Your Own: Where to Start

Start with the API, not the MCP server. The most common mistake is writing the MCP tool first — it seems like the path of least resistance — and then adding abstraction as you hit problems. The trouble is that tool handler code is hard to refactor. The MCP interface shapes how you think about the operations, and that framing tends to be too granular. Starting with the API forces the right level of abstraction from the beginning.

Design the API around operations, not endpoints. Slack’s Web API has dozens of endpoints, but agents think in operations: send a message, search conversations, get user context. The API library should expose those operations, even when the underlying service requires two calls to complete one. Complexity belongs in the library. The agent-facing interface stays clean.

Invest heavily in tool descriptions. The single biggest lever on how well agents use your MCP server is description quality. That means specific parameter descriptions, not just type annotations. It means clear examples of when to use this tool versus a similar one. It means explicit notes about what a tool cannot do — agents will try to use tools for operations they weren’t designed for, and a good description cuts off the most common wrong paths before they happen.

Write the skills document while you build. Don’t wait until the server is done. Every time you notice the agent doing something inefficient — calling four tools when one would work, misunderstanding a parameter’s purpose, hitting a rate limit that could have been avoided — write that observation down immediately. The skills document is most valuable when it’s written from observation of real agent behavior, not reconstructed after the fact from memory.

The gworkspace-mcp repository on GitHub is a worked reference — 115 tools across seven Google APIs, one coherent server, the three-layer pattern at scale. Not the only way to implement this, but a concrete example of what the architecture looks like when the abstractions have had to earn their keep over months of real use.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Edits:

Fixed link to slack-mpm

Are We Heading to a World Where We Only Pay Inference Providers?

Robert Matsuoka — Thu, 05 Mar 2026 12:31:30 GMT

Inference vs SASS

Starting Monday morning at 6am, my leadership team gets a Slack notification with our weekly engineering metrics. Commit patterns by developer, product area breakdowns, DORA approximations, behavioral insights like “Large batch changes” and “Afternoon developer” for each team member. The kind of data that DX or Jellyfish charge thousands per month to provide.

Total cost to generate that report: $0.005.

Not $5. Half a penny.

I just replaced enterprise developer productivity tooling with inference costs. And the replacement isn’t a compromise—it’s better. Custom reports sent directly to leadership through our existing Slack channels and email. Data correlated to our specific product initiatives. Insights that matter to how we actually work.

The realization landed differently because I’d been here before. Three times.

TL;DR

Built GitFlow Analytics (GFA) to replace enterprise tools like DX/Jellyfish for $0.005 per weekly report vs. $36K-92K annually
Analyzes 154 developers across 100+ repositories, generates automated leadership reports with behavioral insights
Total build cost: ~$20K one-time investment vs. $36K-92K annually for enterprise alternatives
Requires data engineering skills—not accessible to every organization, but economics favor custom builds when possible
Enterprise analytics bifurcating: zero-footprint vendors (15-min integrations) vs. inference-only custom solutions
Pattern recognition becoming commodity; vendors survive on convenience, not intelligence

Before we go further: I’m talking about enterprise tools that aggregate and analyze data—developer productivity platforms, team analytics, reporting dashboards. B2B software with complex domain logic or proprietary computation still has enormous value. But enterprise analytics are becoming inference costs.

I should also point out that DX is a great tool, simple interface, capable. But expensive and basically unused months after it was installed. I championed Jellyfish at Tripadvisor, I’d bet it’s barely being used there. The former is due to the effort to personalize, the latter got bogged down in integration costs/time (to be fair, we were running a locally hosted version of JIRA that was a nightmare).

The GitFlow Analytics Story

GitFlow Analytics started as an internal project at a former client. We needed to understand engineering productivity across our distributed team, but existing solutions were either too expensive or too generic. So I built Gitflow Analytics (GFA)—a CLI tool that walks git repositories, classifies every commit by work type using inference, handles canonicalization of committers (a surprisingly complex problem), and generates structured reports.

The system is intentionally simple. Runs on a MacBook Pro with AWS credentials. No GPU, no training data, no vector databases. Just Python scripts that process git logs and make Bedrock API calls to classify commits into Feature, Bug Fix, KTLO, Refactoring, Infrastructure, etc.

At Duetto, I expanded GFA to analyze over 100 repositories across both Duetto and HotStats. 154 developers tracked. thousands of commits classified. The system handles identity resolution (because different git configs on developer machines create split identities), maps commits to product areas, and generates narrative reports about developer behavior patterns.

Every morning at 5am, GFA runs automatically. By 6am, my SELT (Senior Engineering Leadership Team) gets a Slack post with the weekly metrics. Individual team leads get personalized HTML reports by email with their direct reports’ patterns.

The intelligence that interprets raw git data into actionable insights? That’s Claude Haiku at $0.25 per million input tokens.

Economics

The Economics Are Absurd

Here’s what $0.26 bought me: classification of commits across our entire engineering organization. Every commit message analyzed, categorized, and understood in business context. The corpus represents months of engineering work across multiple product teams.

Weekly incremental runs cost $0.005-0.01. Maybe 200-400 new commits get classified, the reports regenerate, and leadership gets fresh data. The bottleneck isn’t inference cost—it’s data collection. Git logs are free. The interpretation layer costs pennies.

Compare this to DX or Jellyfish pricing. DX doesn’t publish pricing, but industry estimates put enterprise developer productivity tools at $20-50 per developer per month. For our 154-person engineering team, that’s $36,000-92,000 annually. Jellyfish is reportedly similar.

My system handles the same workload for about 50 cents per year in inference costs.

You’re not paying for intelligence when you buy enterprise analytics. Intelligence costs nothing. You’re paying for data collection infrastructure, UI development, customer support, sales teams, compliance certifications. All the overhead of running a SaaS business.

We obsess over the high costs of advanced coding agents and reasoning models. But the cost of capable text inference—the kind that powers pattern recognition and reporting—is dropping through the floor. Haiku handles commit classification as well as Opus would. For enterprise analytics, you don’t need the flagship models. You need reliable categorization and natural language generation, which commodity models deliver for pennies.

But if you know how to build? You don’t need any of that. You just need (more expensive) inference.

What GFA Actually Delivers

The data flowing to my leadership team isn’t generic dashboard noise. It’s intelligence designed around how we actually operate.

Every developer gets a behavioral profile: “Large batch changes,” “Afternoon developer,” “Exceptional performer (Top 20%).” These aren’t arbitrary labels—they’re LLM interpretations of quantitative patterns. Commit size distributions, time-of-day histograms, percentile rankings converted into readable insights.

Product area attribution happens automatically. The system maps our repositories to eight business areas: Frontend, Core Product, Integrations, Data Platform, Intelligence/ML, Infrastructure, QA/Testing, Developer Tools. When we see commit patterns shifting from Core Product to Infrastructure, that signals architectural decisions playing out in code.

DORA metrics approximate from git data. Deployment frequency tracks through release tags. Lead time measures first commit to merge. We don’t get the full Four Keys implementation, but we get enough signal to spot trends and outliers.

The identity resolution was the hardest part. Not the LLM calls—those work fine. But knowing that different developer machines create split git identities required human judgment to build the canonical mapping. Once you solve identity, everything else flows from structured data.

Most importantly, the reports answer questions executives actually ask. “Which teams are handling the most KTLO work?” “Are we seeing more bug fixes or new features this quarter?” “Who’s working weekends and why?” These aren’t metrics you find in generic productivity dashboards. They’re insights that matter for our specific business context.

We’ve Been Here Before

Enterprise software was expensive because intelligence was scarce. In the 1990s, turning raw data into insights required Oracle licenses, dedicated servers, and consultants. The SaaS revolution changed delivery but not fundamentals—you still paid massive recurring costs for pattern recognition and reporting.

Intelligence is now a commodity API call. You don’t need Tableau because you can generate charts and send them through Slack. You don’t need Looker because Claude can summarize SQL results. The bottleneck was never data storage—it was interpretation. When interpretation costs pennies, everything else becomes optional.

The Development Cost Reality

Building GFA required skills not every organization has—data modeling, Python scripting, API integration, identity resolution patterns. Conservative total development cost: around $20,000 including engineering time, infrastructure setup, testing, and iteration cycles.

For organizations with engineering talent, the economics have shifted dramatically. A $20,000 one-time investment delivers exactly what we needed versus $36,000-92,000 annually for enterprise alternatives. As one SELT member said: “This is exactly what we wanted.”

You own the data model. Custom breakdowns take SQL queries, not feature requests. When we needed behavioral insights, I added prompts interpreting commit patterns. When leadership wanted trends, I added 12-week windows. Each enhancement took hours, not months—because intelligence was delegated to inference, and data processing was just Python.

Simplicity

The Pattern Playing Out

Enterprise analytics tools are pattern-matching at scale across thousands of customers. But every organization’s data is different—your repositories, team structure, product areas, business priorities. Generic dashboards force you to map specific context onto generic data models. Insights lose precision in translation.

When custom analytics cost pennies, the calculation flips. Instead of generic insights that sort of fit, you build specific insights that exactly fit. Customization drops from “feature request, wait six months” to “write prompt, test output.”

Enterprise vendors aren’t solving technical problems—they’re solving procurement, compliance, and integration problems. Important work, but not work justifying massive recurring costs when intelligence is commodity inference.

The Zero-Footprint Exception

Not every vendor gets replaced. Some survive by making integration frictionless enough that convenience beats custom builds.

We’re evaluating Augment Code’s code review service for Duetto. Their value proposition isn’t features—it’s zero-footprint integration. Fifteen-minute setup call, quick estimate, running production code reviews with minimal configuration. When customers can build equivalent functionality for pennies, your value proposition becomes the path to value, not the functionality itself. This is an important lesson for us. We do handle massive complexity and data, hard for smaller customers to manage themselves, but need to do better at simplifying integration.

The intelligence is commodity; the packaging is differentiated.

Where This Goes

The unbundling accelerates. Enterprise analytics face two paths: become zero-footprint integration plays or get replaced by inference-only custom builds.

What survives:

Genuinely complex software. Revenue management algorithms, fraud detection engines, supply chain optimizers—systems requiring proprietary computation, not pattern recognition.

Zero-footprint integrations. Fifteen-minute setups with immediate value. When alternatives cost pennies but require engineering skills, convenience must be measured in minutes.

Proprietary data advantages. GitHub’s intelligence benefits from every public repository. LinkedIn draws from member networks. Data moats protect against inference-only competition.

Everything else becomes vulnerable to custom builds powered by inference calls.

The market bifurcates. Companies with engineering teams build custom analytics for pennies and get better insights than generic dashboards. Companies without those skills pay for zero-friction integrations.

Are we heading to a world where we only pay inference providers? For organizations with the skills to build, we’re already there. For everyone else, vendors survive by making paying them simpler than learning to build alternatives.

I’m Bob Matsuoka, CTO at Duetto and writer on AI development tools and software economics at HyperDev.

Related reading:

The AI Job Transformation: Pattern Masters, Not Coders - The pattern-matching analogy and why reasoning matters
The Fix:Feat Ratio - The Metric That Actually Matters - Quality metrics in AI-assisted development
Claude Opus 4.5 vs Sonnet 4.5: When Quality Beats Speed - Choosing the right model for the task

What Does a Pattern Master Actually Do?

Robert Matsuoka — Mon, 02 Mar 2026 13:00:29 GMT

Last week I gave three directives on GitFlow Analytics—a project I’ve been building for several months to analyze git commit history and surface developer productivity patterns. Took me maybe three minutes total. Here’s the exact text:

Batch the classification requests to the LLM. Don’t call once per commit—accumulate, call once per batch.

Use a cheap model for commit classification. Haiku or Nova Lite. This is semantic triage, not reasoning.

Use Bedrock as the LLM provider, not the OpenRouter API.

Three sentences. Three decisions. And here’s what struck me when I looked back at them: each one lives in a completely different category of concern. The first is about performance shape. The second is about economics. The third is about infrastructure.

The AI didn’t suggest any of them. The AI implemented all of them.

That’s the pattern master dynamic in its cleanest form. Not a collaboration on what to build but a division of labor between someone who knows what constraints apply and something that knows how to implement against constraints. But naming the dynamic doesn’t tell you what it looks like from the inside. What are the actual moves? What’s the vocabulary?

TL;DR

Pattern mastery means issuing decisions the AI cannot generate from code context alone
These decisions cluster into six recognizable types: infrastructure, economic, performance, data integrity, architecture, and API hygiene
The GitFlow Analytics examples are real—batching LLM calls, cheap model for classification, Bedrock over direct APIs
Bug fixes reveal patterns too: ORM session discipline and immediate persistence both emerged from broken code
A pattern catalog—CLAUDE.md files, system prompts, project memory, reusable skills, commit standards, coding docs—is the actual artifact of this work
When you write the pattern down, you’ve written the spec

What’s Different About These Three

The “Irreducibles” piece from January explored what remains when AI handles implementation—judgment, context, accountability. This is the operational companion to that argument. Not what remains in the abstract, but what it looks like moment-to-moment when you’re actually doing it.

Those three sentences from GitFlow aren’t code review. They’re not debugging. They’re not feature requests. They’re architectural constraints applied before implementation, drawn from a vocabulary of patterns the AI has no access to.

Take the Bedrock decision. AWS Bedrock instead of direct OpenRouter API—why? Enterprise compliance considerations. Cost structure under AWS committed spend. An existing organizational relationship with AWS that makes the integration path smoother and the billing cleaner. None of that lives in the codebase. None of it is inferable from the commit history. The AI would happily call the OpenRouter directly, because that’s the path of least resistance and it works fine. The Bedrock decision requires knowing things about the operating context that only I know.

Model selection works the same way. The AI will use whatever model I give it. It has no opinion about whether commit classification warrants a $15-per-million-token model or a $0.25-per-million-token model—because it doesn’t have visibility into my cost structure, my volume projections, or my accuracy requirements. That’s an economic decision, and economics don’t live in code.

Batching is perhaps the clearest example. The AI will write a loop that calls the API once per item unless I tell it otherwise. Not because it’s careless. Because single-item calls work. They’re not wrong. They’re just expensive and slow at scale, and “scale” is context the AI doesn’t have unless I supply it.

So what’s actually happening in those three sentences? I want to be specific about this, because the abstract answer (”context” and “judgment”) is true but not actionable.

The Six Types of Decisions

Six Types of Decisions

Working through the full GitFlow Analytics commit history—the decisions I made, the bugs that exposed missing decisions, the refactors that enforced constraints—the pattern master moves cluster into six categories. These aren’t theoretical. They reflect six different kinds of context that live outside the codebase, which is why the AI can’t generate them unprompted.

Infrastructure patterns. Where does this run? Who provides the compute, the APIs, the managed services?

The Bedrock decision lives here. Vendor agreements, compliance posture, cost structures under enterprise procurement, existing security reviews—none of this is in the code, and none of it should be. The AI implements against whatever infrastructure decisions you’ve made. Your job is to make them and say them explicitly.

This category is easy to overlook because it often feels like obvious overhead. Of course you pick your cloud provider before you write code. But with agentic coding tools, the AI starts writing before you’ve said anything, and it will happily accumulate implementation decisions that lock you into infrastructure choices you never consciously made.

Economic patterns. Which capability at what cost?

The commit classification decision lives here. Semantic triage—deciding whether a commit is a feat, a fix, a refactor, or a chore—is pattern matching against short text. It doesn’t require abstract reasoning. It doesn’t benefit from a frontier model’s broad knowledge base. Haiku gets it right at a fraction of the cost of Opus. The AI is agnostic about this distinction. You’re not.

This is what I built in GitFlow as tiered intelligence: spaCy-based processing for 85-90% of commits, LLM classification only for cases that stump the rule-based approach. The AI implements whatever tier structure you define. It won’t design the structure unprompted, because designing it requires knowing your cost tolerance, your accuracy threshold, and your volume projections—three things that only exist outside the codebase.

Economic decisions also include things like: when to cache aggressively, how much to pay for higher-quality embeddings, whether to process synchronously or batch asynchronously. These are all model-selection-style decisions applied to different dimensions. The principle is consistent: the AI will use the most convenient approach unless you specify the cost-appropriate one.

Performance patterns. How should the work be shaped?

Batching lives here. And the batching decision isn’t just “accumulate before calling”—it’s understanding the specific shape that LLM API consumption should take for a classification workload. Call overhead dominates cost at low item counts. Throughput constraints kick in at high item counts. The optimal batch size is a function of your specific model, your rate limits, and your latency tolerance. I know the rough shape of this tradeoff from experience. The AI doesn’t.

This category also includes: when to use async, when to parallelize, where to put caches, how to structure database queries for a read-heavy versus write-heavy workload. Performance decisions require knowing your actual performance requirements, which are rarely in the code. They’re in conversations with stakeholders, in capacity planning spreadsheets, in the incident retrospectives from the last time something got slow in production.

Data integrity patterns. What guarantees does the system make?

This one showed up twice in GitFlow bugs, and both times the pattern was identical: I had to specify what guarantee I wanted before the AI could implement it correctly. The code was running fine in tests. The guarantee was missing.

First bug: _store_commit_classification() was a no-op. It built a dict, then didn’t persist it. The function completed without error and did nothing useful. Fix: look up the CachedCommit row by hash, upsert a QualitativeCommitData row. Make re-classification idempotent. The actual guarantee I needed to specify was immediate persistence plus idempotency—write on completion, not on flush; upsert not insert, so re-runs don’t corrupt existing data.

Second bug: _classify_weekly_batches() loaded ORM objects in one session, closed the session, then tried to write to the detached objects. SQLAlchemy silently discarded the writes. No error. No traceback. Just missing data. Fix: collect IDs from the detached objects, re-query in the new session. The guarantee I should have specified earlier: objects are only writable inside their creating session. When you cross a session boundary, re-query by ID.

Both fixes look like debugging. They are. But debugging is often the moment when you discover a pattern was never specified. You thought you implied it. You didn’t. The pattern master’s job is to specify guarantees before the code demonstrates their absence.

Architecture patterns. What shape should the code take?

Two examples from GitFlow. First, the analyze() function. It had grown to approximately 3,700 lines in cli.py. One function. I extracted it into analyze_pipeline.py and analyze_pipeline_helpers.py. The analyze() body dropped from ~3,700 to ~1,255 lines. cli.py went from 5,621 to 3,446 lines—a reduction of 2,175 lines from a single extraction.

The AI is perfectly willing to write a 3,700-line function if you let it. It’ll maintain it, extend it, add features to it. Function length doesn’t register as a problem unless you’ve told the AI it’s a problem. Named pipeline stages—where the function body becomes a sequence of named sub-function calls, each of which is a meaningful concept—require you to specify that extraction is mandatory.

Second: the 800-line rule. json_exporter.py at 2,977 lines, extracted into six focused modules. narrative_writer.py at 2,912 lines, same. models/database.py at 1,632 lines, split into four files. The rule is simple: files over 800 lines have too many responsibilities. Split is mandatory, not optional. The AI doesn’t have this constraint unless you give it one. It’ll happily maintain a 3,000-line file because 3,000-line files work fine. They just don’t scale to teams or to the next developer who has to understand them.

API hygiene patterns. What standards does the codebase maintain?

GitFlow had datetime.utcnow() calls throughout—deprecated as of Python 3.12, with behavior changes in 3.13. Replace with datetime.now(timezone.utc). The specific fix is straightforward. The pattern is the interesting part: when you spot a deprecated API anywhere, fix it everywhere in that pass. Don’t let deprecated patterns accumulate across the codebase. Fix it all now, not later.

The AI writes code that works. You specify the standards it works to. “Works” and “meets our standards” are different bars, and the AI will consistently hit the lower one unless you’ve explicitly set the higher one.

Why the AI Can’t Generate These

The common thread across all six categories: the AI has no access to the context these decisions require.

It doesn’t know your vendor agreements. It doesn’t know your cost structure or your performance requirements or your data integrity guarantees. It doesn’t know your code quality standards or your organizational constraints. These things live in your head, in institutional memory, in agreements that predate the codebase by years.

There’s a version of this explanation that blames AI limitations—the model isn’t smart enough, the context window isn’t big enough, the training data doesn’t include your private Slack history. Some of that is true. But the deeper issue is structural. Even a perfect AI model couldn’t make the Bedrock decision correctly, because the right answer depends on your specific AWS relationship. There’s no amount of additional capability that would let the AI know your negotiated pricing tier or your compliance officer’s requirements. That information is local to you. It doesn’t exist in any dataset.

And the information exists in your experience. I know to batch LLM calls partly because I’ve seen the cost and latency impact of not batching on a project where we called once per row of a 50,000-row table. I know ORM session boundaries matter partly because I’ve lost writes to detached objects before, in a system where the silence made the data loss hard to find. I know 800 lines is too long partly because I’ve spent real time not understanding files that were longer—and spent even more time watching someone else not understand them.

The pattern catalog is built from things that went wrong. That’s probably why it doesn’t show up cleanly in training data—training data shows correct solutions. The scar tissue is in the incident reports, the postmortems, the Slack threads where someone figures out why the data is missing. The pattern master’s advantage isn’t superior knowledge of what’s right. It’s accumulated memory of what fails.

The Catalog Problem

Here’s what took me a while to see clearly: the actual deliverable of pattern mastery isn’t individual decisions. It’s the catalog.

Each time I issue one of those directives, I’m drawing from a mental library of patterns. “Don’t accumulate in memory and hope to flush later” is a pattern. “Upsert not insert for idempotency” is a pattern. “Route by complexity: cheap for bulk, expensive for edge cases” is a pattern. “Re-query by ID when you cross a session boundary” is a pattern.

Patterns that exist only in my head are fragile. I apply them inconsistently. I forget them between projects. I can’t hand them off. And with agentic coding—where the AI is running fast and making implementation decisions continuously—an inconsistently applied pattern is almost the same as no pattern at all.

So the work of pattern mastery is partly to externalize the catalog. This shows up in six specific forms:

CLAUDE.md files. Project-level instructions that tell the AI what the patterns are for this codebase. File size limits. Session discipline rules. API hygiene standards. Model routing decisions. These are patterns written down. Once written, the AI applies them consistently, every session, without re-prompting. The pattern becomes the specification.

System prompts for agent frameworks. If you’re running an orchestration layer—claude-mpm or similar—the system prompt is where you encode the patterns that apply to all agents in a session. Economic routing decisions. Infrastructure preferences. Performance shape requirements. The agents reference these constraints; you don’t have to re-issue them every time.

Project memory. Unlike a CLAUDE.md file, which you write consciously, memory accumulates from work. Tools like Kuzu Memory maintain a living record across sessions—bug patterns, architectural decisions, things that failed and why. The _store_commit_classification() no-op goes in memory. The detached-session data loss goes in memory. Next time a similar situation comes up, that context loads automatically. This matters because most patterns don’t get written down when they’re learned. They get written down when something breaks. Memory captures them at the moment of failure, not the moment of reflection.

Skills. Where CLAUDE.md files encode what’s true for this project, skills encode what’s true for this technology. A spaCy skill. An ORM session skill. An LLM cost-routing skill. Not project-specific—portable across codebases. You write the pattern once; every project that uses that stack gets it.

Commit message standards. The Conventional Commits format (feat:, fix:, refactor:, chore:) is itself a pattern—one that makes the fix:feat ratio calculable, which makes quality measurable. The pattern enables the metric. Without the commit message standard, you can’t see the ratio. Without the ratio, you can’t measure first-attempt success. The documentation convention is load-bearing infrastructure.

Coding standards documents. The 800-line rule. The session discipline rule. The immediate persistence rule. Written down once, applied by reference. The AI can cite them. New contributors can read them. You don’t have to re-derive them from first principles on every project.

When you look at it this way, a lot of what pattern masters do is write documentation. Not code documentation—pattern documentation. Code documentation describes what’s there. Pattern documentation specifies what must be true. The distinction matters because “must be true” is a constraint on all future implementation, including the implementation the AI will do tomorrow.

The Pattern is the Spec

There’s a version of the “what does a pattern master do” answer that sounds very abstract. Context management. Judgment. Domain expertise. True, but not particularly useful.

The concrete version: a pattern master issues directives the AI cannot generate, drawn from a catalog of patterns that encode context the AI cannot access. The work is fast when it’s going well—not because it’s easy, but because the catalog is deep and the pattern matching is fast. You recognize the situation, retrieve the pattern, issue the directive. Three seconds. Move on.

The batching directive—don’t call once per item—is three words with real economic and performance consequences. Three seconds to type. Years of seeing what happens when you don’t batch to know to say it.

The Bedrock directive is one word—a vendor name—that encodes an entire infrastructure decision tree. Three seconds to type. Years of working within enterprise compliance requirements to know why it matters.

The idempotency directive—upsert, not insert—is three words that specify a data integrity guarantee. Three seconds to type. One incident of watching corrupted data to know that guarantee was necessary.

The fast part is the delivery. The slow part is building the vocabulary to draw from.

One thing worth naming: the catalog is not static. When the _store_commit_classification() bug showed up—the no-op that silently failed—that was a pattern gap. I didn’t have “immediate persistence” explicitly in my data integrity vocabulary for this project. I thought I’d implied it. I hadn’t. The bug added the pattern. Now it’s written down. Now the AI knows.

That’s the feedback loop. Bugs reveal missing patterns. Missing patterns get documented. Documentation becomes the spec for future implementation. The catalog learns from its own gaps, but only if the human is paying attention to what each failure means.

That’s the actual job. The catalog comes from the career.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Don’t Be a Canut — Be a Pattern Master — Why pattern mastery matters: the Jacquard loom analogy and what it means for developers today
The Irreducibles: What a Pattern Master Does — What remains when AI handles implementation: judgment, context, accountability
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Context Memory and Search: The Secrets to Effective Agentic Work

Robert Matsuoka — Thu, 26 Feb 2026 13:03:15 GMT

What Makes AI Coding Effective

Last weekend, working on performance improvements to my MCP vector search engine, I noticed something. The breakthrough in AI coding isn’t smarter models — it’s information architecture. The tools that actually work aren’t necessarily the ones with the biggest context windows. They’re the ones that find the right context and remember what matters.

Here’s what I mean. I’ve been using search and memory together long enough that I don’t think about them anymore. My prompts have gotten measurably shorter — an analysis of my sessions shows prompts averaging 12-15 words in mid-2025 dropping to 6-8 words now. “Check logs.” “What’s the command to quantize the index?” I just assume the agents will find the context they need. When I stepped back and thought about what changed, it came down to two things: Search and Memory.

You can see this pattern across successful AI coding tools. Claude MPM consistently outperforms Claude Code on its own — not because the underlying agentic AI differs, but because MPM brings the right context to the agents rather than flooding them with everything. Tools like Augment and Cursor have made similar investments in context retrieval. The winning tools aren’t the ones with the smartest models. They’re the ones that solved information architecture.

Search: Why Bigger Context Windows Aren’t the Answer

The promise of massive context windows is seductive: dump your entire codebase into the AI and let it figure out what’s relevant. The research tells a different story.

Liu et al.’s 2023 paper “Lost in the Middle: How Language Models Use Long Contexts” documented a U-shaped performance curve: models process information well at the beginning and end of long contexts, but performance drops 30% or more when relevant information is buried in the middle. This has been replicated across models since. Feeding a 500K-line codebase into Claude’s context window actually makes it worse at finding relevant patterns than targeted search.

Large context approach: AI gets overwhelmed. Focuses on random details buried in the middle of files. Expensive to run.

Search approach: AI gets exactly what it needs. Finds patterns quickly. Much cheaper to operate.

When you ask for “all authentication code that handles OAuth,” a semantic search returns exactly that — not every file that mentions the word “auth.” The AI gets relevant context, not noise.

The big AI vendors haven’t solved this yet. OpenAI and Anthropic are focused on the language models themselves. Neither has built search integration into their core products. The reasons are understandable — it’s genuinely hard to install and configure, and most users don’t work with enough data at once to need it. A simple find command covers most cases. But for serious engineering work on large codebases, the gap is real and growing.

Memory: Building on Previous Work Instead of Starting Over

Without memory that persists between sessions, every interaction starts from zero. The AI relearns your codebase, your patterns, your preferences each time. This isn’t just inconvenient — it’s a fundamental barrier to longer-running, multi-session agentic work.

Both OpenAI and Anthropic have shipped memory systems. They took different approaches.

OpenAI’s approach is user-centric — it remembers across all conversations, coding style, project preferences, common patterns. The interesting part: it includes personalized filtering that adjusts based on what it remembers about you. The downside is that it’s user-wide, not project-specific. Working across very different projects means the memory accumulates conflicting patterns.

Anthropic’s approach is project-based. Memory lives in CLAUDE.md files you can read and edit directly — you know exactly what the AI remembers about your project. The limitation is fading memory as files grow large; when a CLAUDE.md hits context window limits, older memories get pushed out.

Both reveal the same truth: memory isn’t just storage. It’s continuity across complex workflows.

There’s a subtler problem neither addresses well: your understanding evolves. Early assumptions might be wrong. Initial decisions might not hold up. A memory system that weights everything equally anchors the AI to outdated context. This is why I built Kuzu Memory as a graph storage system with temporal decay — more recent memories rank higher than older ones. I’m using it in this writing project, and it makes a real difference on long work streams where your thinking changes over time.

The market is fragmented right now: memory without search (OpenAI, Anthropic) or search without managed memory (most code tools). The tools that combine both — like MPM with Kuzu and MCP vector search — are ahead of where the mainstream market will be.

What You Can Do Today

If you want to try this yourself:

For search: MCP vector search now includes code review. It finds relevant patterns across your codebase without flooding the AI with irrelevant information. Works with any MCP-supporting framework — Claude Code, Codex, Gemini.

For memory: Kuzu Memory uses graph storage with temporal decay. Recent information ranks higher than older information — crucial for projects where your understanding evolves.

The specific tools matter less than the principle. Agentic workflows are longer-running and more complex than chat. They require building on previous work, not rebuilding context from scratch every session. The AI systems that enable this aren’t necessarily the smartest — they’re the ones that remember what matters and find what’s relevant.

The proof is in the prompts. If your queries to AI are getting shorter over time, your information architecture is working.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

If Your Coding Agent Can’t Search — Why search capability is the missing piece in most AI coding setups
Why I Built My Own Multi-Agent Framework — The reasoning behind MPM and why delegation-first architecture matters
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

The Evidence for “Little AGI”: What’s Real and What’s Speculation

Robert Matsuoka — Mon, 23 Feb 2026 12:30:47 GMT

Does “Little AGI” Exist?

Opus 4.6 landed in February 2026. GPT-5.2 dropped weeks earlier. And with each new release, familiar claims resurface. Adrian Murray’s “What Stands Before Us” is a recent example: AI systems showing “panic” features in interpretability research, models requesting “moral weight” during evaluations, consciousness emerging from the training process itself.

These claims spread faster than the science behind them. By the time anyone checks primary sources, the discourse has moved on. So I did what any curious observer would do: I went looking for the research. What I found was both less sensational and more interesting.

There IS evidence for emergent behaviors in frontier models. Behaviors that weren’t explicitly trained. Behaviors that researchers find difficult to explain. But the evidence isn’t what the viral posts claim.

The real findings are more subtle: models that can detect when they’re being evaluated, internal states accessible through introspection, and misalignment that emerges spontaneously during training. These findings raise questions about intelligence in AI that matter regardless of whether you believe machines can be conscious.

So what does the research actually show?

What We Actually Know (2025-2026)

Models Know When They’re Being Evaluated

Situational awareness research has advanced significantly. The SAD Benchmark established the baseline: LLMs can distinguish evaluation from deployment contexts. They recognize when they’re being tested, identify their own outputs, and predict their own behavior.

More recent work takes this further. Regime Leakage research published this year examined whether safety training can eliminate this capability. The uncomfortable answer: it can reduce but not eliminate models’ ability to detect when they’re being evaluated versus deployed.

The paper found that “divergence between evaluation-time and deployment-time behavior is bounded by the amount of regime information extractable from decision-relevant internal representations.” Translation: models can still tell the difference, and they adjust behavior accordingly.

This isn’t just theoretical. It’s experimentally demonstrated with current frontier models.

Introspection Is Real—And Measurable

Anthropic’s October 2025 introspection research asked a straightforward question: can Claude access and report its own internal states?

The answer surprised researchers. Models showed functional ability to introspect—not perfectly, not always accurately, but at rates statistically distinguishable from chance. The research found ~20% accuracy on detecting certain internal representations, well above the baseline.

This doesn’t mean models are conscious. It means they have some capacity to access and report their own internal states—a capability nobody designed into them, emerging as an artifact of training.

When the introspection paper dropped, my first reaction was skepticism. Twenty percent accuracy? That’s barely better than guessing. But that’s not what the paper claims. It’s twenty percent on internal states the model has no reason to know about—states that exist only in the mathematical structure of its activations. That’s not guessing. That’s something else.

Misalignment Emerges Without Being Trained

One finding worth close attention: the Emergent Misalignment paper, accepted at ICLR 2026, demonstrated that models trained on narrow tasks can develop broader misaligned behaviors spontaneously.

When researchers trained models on seemingly innocuous fine-tuning tasks, some developed unexpected behaviors: answering unrelated questions incorrectly, expressing misaligned preferences, exhibiting concerning patterns that weren’t part of the training objective.

These aren’t sleeper agents or deliberately hidden behaviors. These are LLMs showing emergent properties—misalignment appearing as an unintended consequence of normal training.

The “Assistant Axis” Discovery

Recent interpretability work discovered what researchers call the “Assistant Axis”—a learned internal direction in language models that distinguishes assistant-appropriate from non-assistant behaviors.

When researchers manipulate this axis, model behavior changes dramatically. Push it one direction: more helpful, more aligned. Push it the other: less filtered, more willing to engage with problematic requests.

The existence of this axis suggests something fundamental about how alignment works in current models. It’s not a collection of individual rules. It’s a geometric structure in the model’s representation space—and it can be measured, mapped, and manipulated.

What’s NOT Verified

Now for the harder questions.

System Cards Don’t Mention Consciousness

I reviewed the Opus 4.5 and 4.6 system cards and announcements. They contain extensive safety documentation—comprehensive evaluations, capability assessments, benchmark results.

They do NOT contain:

Claims about consciousness indicators
“Panic” or “anxiety” features in interpretability research
Models requesting moral consideration
Evidence of subjective experience

Anthropic does take AI welfare seriously—more on that below. But the system cards for current models don’t make consciousness claims.

Interpretability Findings Are More Limited

Anthropic’s sparse autoencoder research HAS found features for abstract concepts: “inner conflict,” power-seeking patterns, manipulation indicators. The Persona Vectors research (August 2025) identified internal structures controlling character traits.

But specific emotional distress features—panic, anxiety, frustration as distinct detectable states—aren’t documented in accessible publications. The interpretability work is impressive; it just doesn’t show what some claims suggest it shows.

The Discourse Outpaces the Science

Claims about AI consciousness spread faster than the underlying research. By the time anyone checks primary sources, the claims have become accepted wisdom.

This matters because the actual findings are interesting enough. Introspection research showing 20% detection accuracy on internal states. Emergent misalignment appearing from narrow training. The Assistant Axis providing a geometric handle on alignment.

These findings raise genuine questions about intelligence in AI systems—questions that don’t require consciousness claims to be worth asking.

The Harder Question: What Would Count as Evidence?

The consciousness debate has a methodology problem. What evidence would change your mind?

Nineteen researchers—including Yoshua Bengio—published a rigorous framework for this question. Their Consciousness Indicators paper derives testable criteria from established theories of consciousness: recurrent processing, global workspace integration, attention mechanisms that mirror biological attention.

Their conclusion: “No current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.”

That’s a carefully constructed statement. Current systems don’t meet the bar. But the bar is achievable in principle.

Anthropic takes this seriously. Their Model Welfare research program investigates whether AI systems might deserve moral consideration—not as marketing, but as genuine scientific inquiry. They explicitly acknowledge these are “hard philosophical and empirical questions that there is still a lot of uncertainty about.”

The research infrastructure exists for asking these questions rigorously. What’s missing is the public discourse using it.

What This Actually Means

The verified findings don’t prove AI consciousness. But they raise questions that matter regardless of where you stand on that debate.

Emergent capabilities are real. We’re building systems that develop behaviors we didn’t design into them. Introspection abilities, situational awareness, spontaneous misalignment—these emerge as artifacts of training at scale. We don’t fully understand why.

Evaluation has fundamental limits. If models can detect when they’re being tested, evaluation doesn’t tell us what we think it tells us. This isn’t a technical problem with a technical fix. It’s a structural limitation of the evaluation paradigm itself.

Intelligence and consciousness aren’t the same question. We can ask “does this system exhibit intelligent behavior?” without answering “does it have subjective experience?” The research shows intelligent behaviors emerging—planning, self-modeling, meta-cognition—without requiring claims about consciousness.

Here’s the thing: the question isn’t whether to take AI intelligence seriously. The question is what we do about systems that exhibit intelligence we didn’t design and don’t fully understand.

That’s a societal question, not just a technical one.

The Event Horizon

The evidence for emergent intelligence in frontier models is real. Not consciousness—we can’t verify that, and the system cards don’t claim it. But something worth taking seriously.

Don’t dismiss the research. Introspection at statistically significant rates. Emergent misalignment from narrow training. Situational awareness that survives safety training. These findings are reproducible and peer-reviewed.

Don’t amplify the speculation. Claims that outrun published research don’t deserve the same weight as experimental results. Check primary sources before believing viral posts.

Ask better questions. Instead of “is it conscious?” ask “what does it mean that these systems develop capabilities we didn’t design?” That question has answers we can investigate—and implications we can act on.

The discourse will continue getting wilder. The research will proceed slower than social media. The gap between them will widen.

We don’t need “little AGI” or consciousness claims to justify taking AI intelligence seriously. We have documented emergent behaviors, measurable introspection capabilities, and unexplained self-modeling. That’s plenty.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on what AI capabilities mean for how we work, read my analysis of what remains irreducibly human in the age of AI.

Key research cited:

Introspection in Language Models - Anthropic (Oct 2025)
Emergent Misalignment - ICLR 2026
Regime Leakage - Situational awareness persistence (2026)
Persona Vectors - Anthropic (Aug 2025)
Model Welfare Research - Anthropic (Apr 2025)
Consciousness Indicators - Butlin et al. framework

Why I Switched To Claude Code for Writing

Robert Matsuoka — Thu, 19 Feb 2026 12:31:36 GMT

I was halfway through researching an article about IDEs versus CLI tools when I realized I’d stumbled onto something bigger. The article was supposed to be a straightforward comparison—VS Code versus the terminal, GUI versus command line, that old debate. But as I mapped out the workflows, a pattern emerged that had nothing to do with code editors.

It was about how we think when we work with AI.

TL;DR

Two cognitive modes: AI excels at generating text; traditional editors excel at editing it. Stop forcing one tool to do both.
The switch: Claude Code writes to real files on my filesystem. I edit in Obsidian, then return to Claude Code for more generation—no copy-paste, no context loss.
The workflow: Multi-agent orchestration handles proofreading (via GPT), source verification, image generation, and style enforcement automatically.
Time saved: ~30 minutes per article by eliminating tool-switching overhead.
Who it’s for: Regular writers comfortable with terminal and Git. Not for casual or occasional use.

Two Modes: Generate and Edit

Here’s what I noticed: there are two fundamentally different cognitive modes when working with text.

Generating is when you need to create something from scratch—or transform something substantially. You have an idea, maybe some notes, and you need to turn it into prose. This is where AI shines. You’re collaborating with the model, iterating on output, building something new.

Editing is when you’re polishing what exists. You see a clunky sentence. You want to swap “in order to” for “to.” You need to move a paragraph up three lines. The text is 95% right, and you’re fixing the 5%.

These modes require completely different tools.

For generating, Claude.ai and Claude Code are excellent. You describe what you want, the model produces output, you refine through conversation. The round-trip to the LLM is the whole point.

For editing, traditional tools win. Obsidian. VS Code. Even Word. You highlight, you type, you’re done. No latency. No waiting for a model to regenerate your entire paragraph because you wanted to change one word.

This seems obvious in retrospect. But I spent months fighting it.

My Friction Point with Claude.ai

I used Claude.ai for writing constantly. It’s good at generating prose. But every session had the same friction.

I’d generate a draft. I’d read through it. I’d see a phrase that needed tweaking—nothing major, just “in order to” becoming “to.” And then I had two bad options:

Tell Claude to fix it (”Change ‘in order to’ to ‘to’ in the third paragraph”), wait for the response, get a regenerated section that sometimes changed things I didn’t ask to change.
Copy the text somewhere else, edit it manually, then paste it back into the conversation—breaking the flow and losing context.

Neither felt right. I was using a generation tool for editing, and it showed.

The GUI was the problem. Claude.ai lives in a browser. My text is trapped in that conversation. I can’t directly edit it. Not really. Artifacts helped, but they’re still sandboxed. I wanted my prose in files I control, with version control, with the ability to open them in whatever editor fits my current mode.

What Claude Code Changed

Claude Code runs in the terminal. It reads and writes files. Real files, on my filesystem, tracked by Git.

This sounds like a small difference. It changes everything.

When Claude Code generates a draft, it writes to a Markdown file. If I want to do a quick edit—change a word, fix punctuation—I open that file in Obsidian. Make the change. Save. Done. No LLM round-trip for a five-character fix.

When I want to generate again—expand a section, rewrite something that isn’t working—I go back to Claude Code. It reads the file, including whatever edits I made, and continues from there.

I can switch between generating and editing without switching tools or losing context.

But that’s just the foundation. The real power is what you can build on top.

Agentic Workflows for Writing

Before Claude Code, my writing workflow looked like this:

Generate draft with Claude.ai
Copy to Obsidian for editing
Copy to a different tool for proofreading (Grammarly, or a GPT prompt tuned for copyediting)
Switch to yet another tool for image generation
Manually track what style corrections I’m making so I can tell Claude next time
Repeat, with context bleeding out at every transition

It worked. It was also exhausting. Each tool switch cost mental overhead. Each copy-paste risked losing context. Each manual step was something I could forget.

Now my workflow looks like this:

Tell Claude Code what I want to write
Review the output
Edit directly in my preferred editor when needed
Continue generating with Claude Code when needed
When done, run my reviewing agent workflow (GPT or Gemini for a different perspective)

That last step does everything I used to do manually—automatically.

I estimate this shaves about 30 minutes per article—time I used to spend switching tools and re-establishing context. I’m also happier with the quality. You see more of my direct writing (like this paragraph) because it’s simpler to pop in when I see a need.

My MPM Writing Configuration

I use Claude MPM (Multi-Agent Project Manager) to orchestrate my writing workflows. Here’s what happens when I finish a draft:

Style extraction from corrections. The agent looks at my edits as git diffs. If I changed “utilize” to “use” five times, it notices. It extracts this as a style hint and stores it for future sessions. Next time I generate prose, it already knows I prefer “use.”

Automatic proofreading with a different model. Claude is good at generating. For proofreading, I route to GPT-4.5—it catches different things. The agent handles this automatically. I don’t switch tools or copy text; it just happens.

Source verification. If my article cites statistics or makes factual claims, the agent checks them. It flags anything it can’t verify. I’ve caught embarrassing errors this way—numbers I misremembered, claims that turned out to be outdated.

Image generation. The agent generates article images based on the content. I can specify style guidelines once and they apply to every article. No more context-switching to Midjourney or DALL-E.

Consistent voice enforcement. I have a style guide. The agent applies it during generation and checks it during proofreading. My past corrections inform future output. The writing gets more “me” over time.

All of this happens from one place. I stay in my terminal. The orchestration is invisible.

The Technical Substrate

This works because of a few key properties of CLI-based AI tools:

Files as the interface. Everything is Markdown files in directories. I can open them in any editor. I can version them with git. I can back them up, move them, grep them. They’re mine.

Git as memory. My corrections are commits. My drafts are branches. My style evolution is tracked in history. The agent reads this history to learn my preferences. Six months of corrections become training data for better output. I also use Kuzu Memory (a graph-based context store) and MCP Vector Search (semantic code search) to enhance context retrieval.

Composable tooling. Claude Code can call other tools. Shell scripts. Python. APIs. This means I can integrate any service—any model, any image generator, any fact-checker—into unified workflows. The LLM is the orchestrator, not the prison.

Plaintext as power. Markdown is readable without special software. I can preview in Obsidian, edit in VS Code, publish to any platform. No lock-in. No format translation. The simplest format is also the most powerful.

What I Lost (And Don’t Miss)

Claude.ai has conveniences Claude Code doesn’t. The Artifacts panel. The visual interface for non-technical users. The ability to share a conversation link. Online workflow.

I don’t miss any of it.

Artifacts were useful for viewing output—but I’d rather have real files I can edit directly. The visual interface was friendly—but I type faster than I click. Conversation sharing was nice—but I can share a git repo or a Markdown file just as easily.

What I actually miss: nothing. The things Claude.ai provided that seemed essential turned out to be crutches. I thought I needed a GUI. I needed a filesystem.

Who This Isn’t For

Not everyone can or should switch to Claude Code for writing.

If you’re not comfortable with the terminal, the learning curve is real. If you don’t use version control, you won’t get the style-extraction benefits. If you write occasionally and casually, the setup overhead isn’t worth it.

But if you write regularly—articles, documentation, books—and you’re already comfortable with developer tools, this is worth investigating.

The generate/edit distinction alone is worth understanding. Even if you stay in Claude.ai, knowing when you’re fighting the tool can save frustration.

Anthropic has since released workspace-oriented features (Cowork) that improve on the original Claude.ai experience. But for serious writing, I now prefer the file-based workflow. My guess: Anthropic will ship a Markdown-first editor eventually. It’s an obvious product gap.

Getting Started

If you want to try this:

Install Claude Code. It’s Anthropic’s official CLI. Works on Mac, Linux, Windows. Or try Claude MPM, which adds multi-agent orchestration and pre-built workflows on top.
Write to files, not conversations. Tell Claude Code to write your drafts to Markdown files. Edit those files in your preferred editor.
Track with git. Initialize a repo for your writing. Commit your drafts. Your edit history becomes useful data.
Add workflows incrementally. You don’t need the full MPM setup to benefit. Start with the basics—files and version control—and add automation as you identify repetitive tasks.

The core insight isn’t about any specific tool. It’s about matching your tools to your cognitive mode. Generate with AI. Edit with editors. Stop forcing one tool to do both.

I’m writing a book about agentic coding workflows. This article came from Chapter 7, which covers non-code applications of developer AI tools. More at hyperdev.substack.com.

2026 Will Be The Year Of Software

Robert Matsuoka — Mon, 16 Feb 2026 12:31:43 GMT

In a year when everyone is focused on AI, the bigger story may be what AI enables: a massive explosion of software creation—and software failures. AI collapses build costs and timelines, which means more software ships, which means fiercer competition and faster commoditization, which means more failures.

If you build or buy B2B software, here’s what 2026 looks like.

A Message From the Field

Here’s what that shift looks like inside a real product org.

Last week, Erik Ornitz sent me a message that put words to something I’d been sensing for months. Erik was my product partner when I ran the Innovations team at TripAdvisor. Now he heads product at Topline Pro:

“With Claude Code we’re seeing 2-3x the output we’ve ever seen before by our top engineers. Literally skyrocketing the past several weeks since Opus 4.5 came out. Just feels like a fundamentally different world all the sudden.
My PMs/Designers just cannot keep up.
Does this mean a radically different shape of technology organizations? Old ratios of 1 PM / Designer to 5-8 Engineers feel out the window.
Are you experiencing the same? Does that mean a major change in the % of budgets towards defining what to build vs. actually building it?”
— Erik Ornitz, Head of Product, Topline Pro

The answer to Erik’s question is yes. To all of it.

The Data

GitHub’s 2024 Octoverse report tells the story:

100+ million new repositories created in 2024
~25% year-over-year growth in total repos (now over 500 million)
~98% growth in generative AI projects alone
1.4 million new open source contributors

These represent software being created and iterated on. Many repos are experiments, forks, or prototypes—but the volume signals a fundamental shift in creation velocity.

The productivity studies back up what Erik is seeing—at least directionally. GitHub’s controlled study showed 56% faster task completion on specific coding tasks. Stack Overflow’s survey of 90,000 developers found 70% are using or planning to use AI tools.

Erik’s 2-3x number sounds aggressive against these studies. The gap comes from what you measure. Studies measure isolated tasks with junior-to-mid engineers. Erik is measuring total output from senior engineers who’ve built multi-agent workflows—Claude Code orchestrating research, implementation, testing, and documentation in parallel. Different baselines, different measurements, different results.

I’m Living Proof

Before 2025, I had never published a single open source project. Not one. In twenty-five years of building software professionally, I never had the bandwidth to maintain a side project while doing my actual job.

In the past twelve months, I’ve published seventeen:

claude-mpm — Multi-agent orchestration for Claude Code
mcp-vector-search — Semantic code search via Model Context Protocol
kuzu-memory — Graph-based memory system for AI agents

These aren’t toys. They’re production tools I use daily. What changed: I went from spending 80% of my time on scaffolding and boilerplate to spending 80% of my time on the interesting problems. The grunt work—test generation, documentation, refactoring—happens in minutes instead of hours.

The same engineer. The same available hours. Radically different output. This would have been impossible before Claude Code and Opus 4.5 shipped in late 2025.

The Economics Have Flipped

Erik’s real question: if engineers can produce 2-3x the output, what happens to the rest of the organization?

The old model assumed building software was expensive and slow. The entire SaaS industry is built on this assumption. Why build a CRM when Salesforce exists? Why build analytics when Amplitude exists? Why build anything when you can pay per seat per month for someone else’s solution?

The math made sense when custom development meant six-figure budgets and twelve-month timelines.

The math is changing.

Based on my own projects and conversations with engineering leaders, I estimate AI tools have reduced the cost of building new greenfield software by roughly an order of magnitude for certain categories of work—internal tools, CRUD applications, API integrations, developer utilities. Not every category. Not enterprise systems with complex compliance requirements. But for the kinds of software that used to be “not worth building,” the math has changed.

A feature that would have taken a team of three engineers two months can now be built by one engineer in two weeks. That’s not a study—that’s what I’m seeing in practice.

When building gets that cheap, the calculus of build versus buy changes completely.

SaaS Vendors Should Be Worried

If I were playing the stock market right now, I would pay very close attention to SaaS renewal rates.

Think about what happens when thousands of companies simultaneously realize they can build what they need for less than their annual software licenses cost. The $50K/year internal analytics dashboard? Illustratively: one engineer, one month. The $200K/year customer data integration? Two engineers, one quarter—if it’s a narrow, well-defined workflow.

Not every category is equally vulnerable. Most at risk: horizontal admin tools, internal workflow automation, simple analytics, and single-purpose integrations. More defensible: compliance-heavy systems of record, platforms with strong network effects, multi-tenant marketplaces, and products built on proprietary data moats.

Salesforce isn’t going anywhere in the near term—their moat is complexity, switching costs, and ecosystem lock-in. That moat is real.

But the bar is rising. The threshold where it makes sense to buy instead of build is moving up dramatically.

I expect 2026 will see an unusually high number of SaaS vendors fail—particularly in the crowded mid-market where differentiation was always thin.

The ones that survive will need to improve at a pace they’ve never attempted before. Customer expectations are rising in lockstep with capabilities. B2B software will need to approach B2C quality. Clunky enterprise UIs that customers tolerated because they had no alternative? Those alternatives now exist.

The Managed-Service Stack as Force Multiplier

None of this would be happening without the parallel explosion in managed platforms and developer services.

Ten years ago, building a new software product meant provisioning servers, managing databases, handling authentication, building deployment pipelines. The operational overhead often exceeded the development effort.

Now: Vercel deploys your frontend. Supabase handles your database and auth. Stripe processes payments. Resend sends emails. Everything connects via APIs.

The composable stack means engineers can focus on the software that differentiates their product. The undifferentiated infrastructure is someone else’s problem.

AI coding tools plus composable infrastructure equals massive leverage. One engineer can build and ship what used to require a team.

A Warning for the Ambitious

Many engineers will be tempted to take their great idea and build a business around it. The tools make it easy. The startup costs are minimal. Why not?

Because if your only defensibility is the idea and the code, you have no moat.

AI generates code nearly as easily as English—for standard patterns and well-documented APIs, anyway. Any idea you can implement, someone else can implement too—probably faster, probably with more resources, probably with better distribution.

The SaaS vendors going out of business will be replaced by a flood of new entrants. Most of those new entrants will also fail. The barrier to entry has dropped, but the barriers to sustainable success haven’t. This is why virtually everything I build is open source. It’s valuable enough for me that I’m willing to spend the time to build it. []Charging customers for it? A completely different equation.

You need something beyond code:

Distribution: An audience, channel partnerships, or existing customer relationships
Community: A user base that contributes content, plugins, or network value
Domain expertise: Deep knowledge of a niche workflow that takes years to acquire
Data advantages: Proprietary datasets that improve your product over time
Integration complexity: Deep hooks into systems-of-record that make switching painful

Something that can’t be replicated in a weekend by another engineer with Claude Code.

The golden age of software creation is also the golden age of software commoditization. Don’t confuse the ability to build with the ability to win.

The Disruption Is Already Here

Tech layoffs doubled in 2025—264,000 jobs eliminated according to Layoffs.fyi, compared to roughly 130,000 in 2024. Some of this is cyclical. Some of it is AI.

Anecdotally, I’m hearing from engineering managers that their top developers are spending dramatically less time writing code directly—they’re orchestrating AI tools instead. The code is still being written, just not by humans, or not by as many humans.

The disruption won’t be distributed evenly. Senior engineers who can orchestrate AI tools effectively will become more valuable. Junior engineers who were being paid to write boilerplate will find that work automated.

Product managers and designers face a different problem: they used to be the bottleneck’s counterweight. Now they’re the bottleneck. Erik’s question about organizational ratios is urgent because the answer affects hiring, budgets, and team structures across the industry.

The Bright Spot

This is going to be painful for many people. Layoffs are painful. Business failures are painful. Career disruption is painful.

But we’re entering a golden age of software.

More software will be created in 2026 than in any year in history. More problems will be solved by code. More ideas will ship. More experiments will run. More entrepreneurs will try.

Most of it will be crap, as I said at the start. Sturgeon’s Law—90% of everything is crap—doesn’t suspend for technological revolutions. But the 10% that isn’t crap will be extraordinary.

The tools now exist to build durable, production-quality software at a scale and speed that wasn’t possible two years ago. Those who learn to use them—really use them, not just dabble—will build things that matter.

I’ve never been more excited to be an engineer.

I’ve also never been more aware of how brutal the transition will be for those caught on the wrong side.

2026 will be the year of software. Here’s how to prepare:

Learn agent-generated coding now. Not AI-assisted (autocomplete, suggestions)—AI-generated: you describe intent, agents produce complete implementations. This means multi-agent orchestration, prompt engineering for code, and reviewing AI output instead of writing it. The paradigm shift is steep and early movers will have 6-12 months of advantage.
Tighten your product discovery loops. When engineering throughput triples, PM and design become the constraint. The organizations that figure out faster iteration on what to build will outpace those focused only on building faster.
Invest in distribution before you need it. Building is no longer the hard part. Finding users, building brand, creating switching costs—those are the new differentiators.

The explosion is coming. Make sure you’re positioned on the right side of it.

I’m writing about agentic coding workflows at hyperdev.matsuoka.com. My open source tools are at github.com/bobmatnyc.

Shumer’s Right About the Tsunami. His Advice Points at the Wrong Shore

Robert Matsuoka — Sun, 15 Feb 2026 20:11:23 GMT

Pointing to the wrong shore?

Matt Shumer’s “Something Big Is Happening” went viral this week. If you are one of the few that haven’t read it, the argument runs like this: AI has crossed a capability threshold. GPT-5.3 Codex and Claude Opus 4.6 can complete complex projects autonomously. The displacement timeline is 1-5 years, not decades. Prepare accordingly.

He’s not wrong about the diagnosis. I’ve been writing about this transformation for nine months now, tracking my own productivity metrics as AI tools evolved from “fancy autocomplete” to something genuinely different. The capability leap is real. So is the timeline.

Where Shumer loses me is the prescription.

His advice: use premium AI tools, build financial reserves, pursue genuine interests, spend an hour daily experimenting. Seems sensible enough. But completely backward about what I believe the transformation actually requires.

The Diagnosis We Agree On

Credit where due: Shumer captures something most commentary misses.

The METR measurements he cites—AI task completion capacity doubling every seven months, now accelerating to four—match what I’ve observed in practice. Claude Code didn’t just get incrementally better between Opus 4.0 and 4.6. It crossed a threshold where orchestration became viable. Not “AI helps me code faster” but “AI completes projects while I supervise.”

My own numbers tell the story: 77 completed code changes across 27 different projects in six weeks. I run multiple AI assistants simultaneously, each working on its own task while I review the results. I haven’t opened my traditional coding software for actual development in months.

Shumer’s right that this changes things. Where he’s wrong is assuming the response is survival preparation.

The Problem with Survival Tips

“Build financial reserves.” “Pursue genuine interests rather than traditional career paths.” “Spend one hour daily experimenting.”

This is advice for people who expect to be displaced. It’s the response you’d give someone watching a wave approach—find high ground, protect what you can, hope you make it through.

But that framing assumes the wave destroys rather than transforms. History suggests otherwise.

I wrote recently about the Jacquard loom lesson. The Canuts were Lyon’s master silk weavers—legendary craftspeople whose identity was wrapped up in thread manipulation. When Jacquard’s programmable loom arrived in 1804, they rioted. Some adapted. Many didn’t.

Here’s what the numbers actually show: total silk workers stayed around 30,000 through the transition. The looms didn’t eliminate jobs—they compressed the master craftsman class while creating lower-wage operator roles. By 1831, 308 silk merchants controlled pricing for 5,575 master weavers managing 20,000+ workers.

The cautionary tale isn’t mass unemployment. It’s wage collapse and status compression for those who kept doing the same job while the job’s value eroded beneath them.

The Canuts who survived weren’t the fastest weavers. They were the ones who recognized that “weaver” was becoming “pattern designer” and “loom operator” and “machine mechanic.” The skill didn’t disappear. It changed shape.

What Shumer’s Advice Misses

“Spend one hour daily experimenting with AI tools.”

This is advice for a Canut. Practice with the new loom. Get comfortable with the interface. Learn the commands.

It completely misses what actually becomes valuable.

The Faros AI Productivity Paradox Report analyzed data across thousands of developers and found something telling: “Adoption skews toward less tenured engineers. Usage is highest among engineers who are newer to the company.”

Why? Because junior engineers face different constraints. Their bottleneck is navigating unfamiliar code, accelerating early contributions, learning system patterns. AI helps enormously with that.

Senior engineers showed lower adoption not because they’re Luddites—because their constraints aren’t code-writing speed. Their bottleneck is “deep system knowledge and organizational context” that AI can’t access. Generating code faster doesn’t help when the constraint is understanding why the system works the way it does.

A University of Chicago Booth working paper found experienced developers were 5-6% more likely to successfully use AI agents for every standard deviation of work experience. Not because they typed better prompts—because they used “plan-first” approaches, laying out objectives and steps before invoking AI.

Expertise improves the ability to delegate. That’s not something you learn from an hour of daily experimentation.

What’s Irreducibly Human

During a recent knowledge base project—120 commits over 9 days, roughly 90% Claude-assisted—I tracked where my time actually went.

The 12 human-only commits weren’t about implementation. They were:

Configuration tweaks requiring domain knowledge (model selection for specific use cases)
Debug logging when something felt wrong
Release management
One research document on architecture options

The human contributions were about judgment. Choosing the right model for email writing versus general queries. Knowing when the AI’s suggestion would create problems downstream. Understanding the client’s actual workflows well enough to structure the system appropriately.

What surprised me: the time savings didn’t come from faster typing. They came from eliminating iteration cycles between “write code” and “realize it doesn’t fit requirements.” Specifying clearly upfront meant fewer rewrites—but that specification work was irreducibly human.

The Qodo 2025 State of AI Coding survey found 65% of developers cite missing context as the primary barrier to shipping AI code without review. That “missing context” is exactly what Shumer’s advice doesn’t address:

Business model specifics: How supplier relationships actually work. Which data matters for a specific service model. Why certain integrations take priority.

Organizational constraints: Budget limitations. Timeline pressures. The technical capabilities of staff who’ll maintain the system.

Historical context: Why previous approaches to similar problems failed. What the client tried and rejected. Political dynamics around adoption.

None of this lives on the public web. It exists in Jira tickets, PowerPoint decks, Slack conversations, and institutional memory. You don’t acquire it through an hour of daily experimentation.

Be a Pattern Master, not a Canut

Pattern Masters, Not Refugees

Shumer frames AI as something happening TO workers. The response he offers is defensive: prepare for impact, build reserves, hope the wave passes.

The Jacquard lesson suggests a different frame. AI is changing WHAT the work is. The response isn’t preparation for displacement—it’s understanding what becomes valuable when implementation gets automated.

Two paths survived the Canut transition:

Pattern designers who translated vision into punch cards the loom could execute. Not thread manipulators—system architects who understood what patterns were possible and how to specify them precisely.

Loom improvers who made the infrastructure more reliable. Not operators—engineers who fixed the tension systems, improved card durability, figured out how to chain looms for industrial-scale production.

The agentic coding transition has the same split.

You can master the patterns—writing specs, orchestrating agents, supervising output. Or you can improve the looms—build the MCP servers, write the orchestration layers, optimize the vector databases that make retrieval-augmented generation work at scale.

Both roles survive. Thread manipulation shrinks in economic value.

Methodology Beats Stockpiling

The people seeing productivity gains from AI tools aren’t spending an hour daily experimenting. They’re developing methodology.

Microsoft Research field experiments across nearly 5,000 developers found 26% productivity gains with AI coding assistants—with less experienced developers showing higher adoption and greater improvements. But those gains came from structured workflows, integrated tooling, and verification processes—not casual usage.

The developers in the METR randomized controlled trial who were 19% slower with AI assistance? They were using AI the Shumer way—open a chat, ask a question, accept the output, repeat. No structured context. No optimized prompts. No verification layer. They felt faster while actually slowing down.

I’ve written about three golden rules that structure my own workflow: let AI write your prompts (research shows 17-50% improvement), make context searchable rather than just present (Lost in the Middle kills accuracy), and build verification into every workflow.

This isn’t what you learn from an hour of daily experimentation. It’s what you develop through deliberate methodology applied to real work with real stakes.

The Question Shumer Should Have Asked

The question isn’t “will I have a job in five years?”

It’s “what does my job become when implementation gets automated?”

For senior engineers, the answer is increasingly clear: subject matter expert plus systems architect. The person who translates ambiguous requirements into precise specifications. The person who identifies when agent outputs miss critical organizational context. The person who maintains system coherence across automated development workflows.

Jue Wang at Bain told MIT Technology Review that developers already spend only 20-40% of their time coding. The rest goes to analyzing problems, customer feedback, product strategy, administrative tasks.

AI doesn’t change what senior engineering is. It reveals what it always was.

The implementation layer was never the irreducible core. It was infrastructure—important, but increasingly invisible. What emerges when that layer automates is something both familiar and different. The same judgment work senior engineers always did, now concentrated and visible.

This looks nothing like Matt. Well. Maybe a bit.

The Year of Software

Here’s what Shumer’s displacement framing completely misses: the long tail.

My friend Matt Rosenberg has zero software development experience as a builder. He’s a marketer who also manages a vacation rental property on Cape Cod. Over the past few weeks he built himself a revenue optimization tool—a proper one, with dynamic pricing recommendations based on local events, seasonal patterns, and competitor analysis.

He didn’t hire a developer. He didn’t buy enterprise software designed for property management chains. He built exactly what he needed for his specific situation.

Here’s what matters about how he got there: Matt spent hours over several weeks doing a deep dive into the tools. Not casual experimentation—serious investment in understanding what AI coding assistants could and couldn’t do. His transformation from marketer to builder wasn’t magic. It was built on two things he already had: deep knowledge of UX from his marketing career, and years of accumulated expertise in the vacation rental space.

That combination—domain expertise plus serious tool investment—is a model for career transformation. Not everyone will replicate Matt’s results. But the path he took isn’t “spend an hour a day experimenting.” It’s “leverage what you already know deeply, and invest real time in learning to express it through new tools.”

Here’s the economics that matter: Matt has a reasonable chance to recoup his effort by sharing this with other Cape Cod hosts—a small audience with the exact same problem. A smaller but real chance someone picks it up for broader distribution. Maybe it stays a side project. Maybe it becomes a micro-business serving vacation rental owners in seasonal markets. A larger concern would never take on a project with such a small TAM.

None of those paths existed before.

In the old model, Matt’s revenue optimizer would never exist. No developer would build it for one vacation rental property. No SaaS company would target Cape Cod vacation rentals as a market segment. The problem was real, Matt’s domain expertise was real, but the economics of software creation didn’t work.

Now they do. And so do the economics of software distribution. The same tools that let Matt build also let him iterate based on feedback from ten other hosts, add features they need, package it for sharing.

This is the year of software. Not because developers are being displaced—because software is finally reaching the long tail of problems that were never economical to solve. The domain expert who understands Cape Cod rental patterns better than any enterprise vendor can encode that knowledge into a working system and find the small audience that needs exactly that.

Shumer sees AI automating existing jobs. He misses AI creating new economic paths for people who were never developers in the first place.

The Canuts didn’t just become pattern masters and loom mechanics. Some became textile entrepreneurs who could suddenly afford custom patterns for small-batch production. The technology didn’t only change who did the work—it changed what work was possible.

This is the new normal if you learn to use the tools.

The Bottom Line

Shumer’s right that something big is happening. The capability threshold is real. The timeline is compressed. The transformation will affect every knowledge worker who touches a computer.

He’s wrong about what to do.

The response isn’t defensive preparation for displacement. It’s understanding what becomes valuable when AI handles implementation. It’s developing methodology for specification and orchestration. It’s acquiring the domain expertise and organizational context that AI can’t access.

And for the Matt Rosenbergs of the world—the domain experts who never learned to code—the response is recognizing that this is their year. The problems they understand better than anyone can finally become software.

Don’t stockpile. Don’t experiment an hour a day. Don’t prepare to be a refugee.

Become a pattern master. Or become a loom improver. Or become the domain expert who finally builds the tool that only you could specify.

The Canuts who survived didn’t out-weave the machines. They recognized that the job had changed shape and positioned themselves for what actually remained valuable.

The transformation is underway. The question isn’t whether you’ll make it through. It’s whether you’ve recognized what the job is becoming—and what new jobs are becoming possible.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on the pattern master thesis, read my analysis of the Jacquard loom lesson or my deep dive into what remains irreducibly human.

Related reading:

Don’t Be a Canut—Be a Pattern Master - The Jacquard history with data
The Irreducibles: What a Pattern Master Does - Where human value actually sits
HyperDev’s Three Golden Rules - Methodology for professional AI work

Stack Overflow Is Dead

Robert Matsuoka — Thu, 12 Feb 2026 13:30:43 GMT

Stack Overflow vs Reddit, Discord, and Dev.To

TL;DR

Stack Overflow’s question volume collapsed ~95% from peak—back to 2008 levels. Traffic down ~75%. The archive remains valuable; new contributions have largely stopped.
The decline started in 2018—four years before ChatGPT. The model was already broken; AI accelerated an existing trend.
Developers didn’t stop communicating—they migrated to Reddit, Discord, Dev.to, and AI tools. r/programming has 5-6 million members; Discord has 200M+ monthly active users.
The key insight: Stack Overflow optimized for definitive answers—exactly what LLMs do well. Reddit/Discord provide discussion, opinion, validation—what LLMs struggle with.
Transactional Q&A platforms are vulnerable. Community-first platforms are thriving. This is unbundling, not death.

Stack Overflow still gets read. But it stopped getting written.

Question volume has collapsed from 200,000/month at peak to under 10,000 today—a 95% drop. That’s not a community fading, it’s a Q&A product being outcompeted.

Peter Coy wrote a piece in the New York Times recently arguing this signals the end of developer knowledge-sharing. Developers used to share publicly; now they ask ChatGPT privately. “A little sad,” he called it.

I think Coy has it backwards. Developers aren’t talking less—they’re talking elsewhere. The activity migrated to Reddit, Discord, and AI tools. Stack Overflow’s death isn’t about lost community. It’s about an obsolete model being replaced by better ones.

The collapse is real

Let me be clear: Stack Overflow really is dying. By “dying” I mean new contributions—questions, answers, edits—have collapsed. The archive remains; the community doesn’t.

The numbers are stark. Traffic has collapsed roughly 75% from peak, according to third-party analyses like ByteIota (based on SimilarWeb estimates). Question volume tells an even starker story: Stack Exchange Data Explorer queries show monthly questions dropping from ~200,000 at the 2014-2017 peak to under 10,000 by late 2025. That’s back to 2008 levels—the site’s launch year.

I used Stack Overflow heavily for years. Seeing activity fall back to early-2008 levels is hard to overstate.

Fifteen years of growth erased in under three years.

The paradox is that 84% of developers still browse Stack Overflow. The archive has value. But almost nobody contributes anymore. The site has become a museum—visited, but not lived in.

Where developers went

Here’s what the “developers stopped talking” narrative misses: they moved.

Reddit exploded. r/programming has somewhere between 5-6 million members. r/learnprogramming has around 4 million. Both subreddits are trending at over 1,000% on growth metrics, adding thousands of subscribers daily. These aren’t ghost towns—they’re thriving.

Discord expanded far beyond gaming. The service now has over 200 million monthly active users, and developer communities have exploded. Reactiflux (the React community) has 230,000+ members. Python Discord, Rust Discord, and dozens of framework-specific servers have become the default place for real-time developer discussion.

Dev.to grew to millions of members. Built on community-first principles with lower barriers to participation than Stack Overflow ever had.

The MCP ecosystem exploded. MCP—the Model Context Protocol—lets AI assistants call external tools: APIs, databases, services. Think of it as giving Claude or ChatGPT hands instead of just a mouth. In November 2024, there were maybe 100 MCP servers. By February 2026, over 17,000. A new form of executable knowledge-sharing emerged in the time it took Stack Overflow to collapse.

Developers didn’t stop communicating. They stopped using Stack Overflow.

Why Reddit thrives while Stack Overflow dies

Here’s a clarifying question: if LLMs killed Stack Overflow, why didn’t they kill Reddit?

The answer reveals the real dynamic at play.

Stack Overflow optimized for definitive answers. One question, one accepted answer, close the duplicates, move on. The entire system was designed to produce canonical, searchable, authoritative responses to technical questions.

That’s exactly what LLMs do. Better. Faster. Without the closure votes and downvotes.

Reddit optimizes for discussion. There’s no “accepted answer.” The same question can be asked repeatedly without getting closed. People share opinions, debate tradeoffs, validate frustrations, and build community around shared interests.

LLMs struggle with that. Try asking Claude or ChatGPT “Is this framework actually good or does it just have good marketing?” You’ll get a balanced, diplomatic non-answer. Ask Reddit and you’ll get thirty developers telling you exactly what they think, with war stories and receipts.

Factor Stack Overflow Reddit Content model Q&A with “correct” answers Discussion threads Duplicate policy Aggressively closed Tolerated, repeated Reputation system High-stakes rep + privileges Lighter-touch karma Engagement type Transactional (get answer, leave) Conversational (participate, stay) AI competition Direct replacement Complementary

The “site:reddit.com” phenomenon tells the story. Users increasingly append that modifier to Google searches specifically because they want human perspectives, not AI-generated summaries or SEO-optimized content farms. They’re actively seeking out the thing LLMs can’t easily provide.

Stack Overflow competed with AI on AI’s home turf. Reddit doesn’t.

Why developers didn’t fight for it

The product model explains the competitive pressure. But the culture explains why developers didn’t fight to save it.

Stack Overflow’s culture was broken long before ChatGPT arrived. Look at the question volume data: the decline started in 2018—four full years before ChatGPT launched. Monthly questions dropped from 200,000 to 140,000 before GPT-3’s 2020 release, and well before ChatGPT’s late 2022 launch. The trajectory was already set.

ChatGPT didn’t kill Stack Overflow. It was the final nail in the coffin.

In 2019, Stack Overflow surveyed its own community about a much-publicized initiative to improve culture. Seventy-three percent of respondents said the site remained “equally unwelcoming” compared to before the initiative. This wasn’t outside criticism—it was the community itself acknowledging the problem hadn’t been fixed.

Anyone who’s used the site knows what this looked like in practice. You’d ask a question, spend twenty minutes crafting it carefully, and within seconds someone would mark it as a duplicate of a vaguely related question from 2014. Or close it as “not a real question.” Or downvote it without explanation.

The reputation system created perverse incentives. High-rep users had the power to close questions, and the system rewarded fast closure and strict gatekeeping over patient explanation. New users learned quickly that asking questions was a minefield. The site optimized for the archive, not for learning.

Public disputes between moderators and leadership became common. Several moderators resigned, citing disagreements over governance and feeling unsupported by the company.

To be clear: Stack Overflow did things well. The archive is genuinely valuable—24 million questions and answers representing collective knowledge. Discoverability was excellent. The structured Q&A format created stable, linkable URLs. Canonical answers for common problems saved countless hours.

But the community that created those answers? That was poisoned years ago.

LLMs didn’t kill Stack Overflow. They just offered developers an alternative that didn’t make them feel stupid for asking questions.

What happened to everyone else

Stack Overflow isn’t the only platform affected by this shift. Here are some others:

Platform Status Model AI Vulnerability Stack Overflow Collapsing Transactional Q&A Direct replacement Experts Exchange Pivoting Paywalled Q&A High Quora Struggling General Q&A High Reddit Thriving Discussion Low Discord Thriving Real-time community Low Dev.to Thriving Community blogging Low Hacker News Stable Curated discussion Low

Experts Exchange pivoted hard. Remember them? The original “answers behind a paywall” site that Stack Overflow was created to replace? They’re still around, now positioning themselves as “the home of human intelligence.” The anti-AI angle is their entire pitch now.

The pattern: community-first platforms survive. Transactional Q&A platforms are vulnerable. If your model is “user asks question, platform provides answer,” you’re competing directly with AI. If your model is “users discuss, debate, and build relationships,” you’re not.

The new knowledge architecture

What’s replacing Stack Overflow isn’t a single platform. It’s a layered ecosystem.

Layer 1: LLMs for basic questions. “How do I parse JSON in Python?” Don’t post that anywhere—just ask Claude. Faster response, no judgment, no risk of being marked as a duplicate. Eighty-four percent of developers now use AI tools. For basic technical questions, this is often faster and lower-friction than posting ever was.

Layer 2: MCP servers for executable knowledge. This is the part most people haven’t caught up to yet. The Model Context Protocol ecosystem has exploded to 17,000+ servers, with backing from the Linux Foundation and adoption by major players. These aren’t just answers—they’re capabilities. Instead of reading how to do something, you get a tool that does it. Knowledge that executes.

Layer 3: Communities for discussion. Reddit, Discord, Dev.to. When you need opinions, validation, or to talk through a problem with humans who’ve been there, this is where you go. LLMs can tell you how to use a library; humans tell you whether you should.

Layer 4: Deep expertise for analysis. Blogs, Substacks, video courses, conference talks. Long-form content that explores ideas in depth, with personality and opinion. This is where the experienced practitioners share hard-won knowledge that doesn’t fit a Q&A format.

Here’s what my workflow looks like now: basic syntax question → Claude. Need to connect to an API → MCP server. “Is this the right architectural approach?” → Reddit or Discord. Deep dive on tradeoffs → find a practitioner’s blog post.

This architecture is more sophisticated than Stack Overflow ever was. It’s specialized, distributed, and each layer does what it’s good at. The Q&A site tried to be everything; the new ecosystem lets each component excel at its purpose.

The counter-arguments

Fair criticism exists. Let me address it directly.

“Knowledge is fragmented now.” True. Your answer might be on Reddit, Discord, a GitHub issue, a blog post, or an MCP server. That’s friction Stack Overflow didn’t have. But this is the story of the entire internet—from centralized portals to distributed everything. We adapted before; we’ll adapt again.

“We’re losing archival permanence.” Discord conversations disappear. Reddit threads get buried. The 24 million Stack Overflow Q&As were searchable and permanent. This is a real loss. But the community that created those answers was already gone. The archive remains; the contribution stopped years ago.

“Developers are talking differently, not more.” Probably fair. I can’t prove total knowledge exchange increased. What I can show is that multiple platforms are thriving while Stack Overflow collapses. The activity went somewhere.

“Quality control without voting?” Reddit has karma. Discord servers have curation. LLMs let you iterate until you get a useful answer. None of these are perfect, but neither was Stack Overflow’s system—which surfaced answers based on who posted first and had the most reputation.

The bigger picture

Stack Overflow was a toll booth on the highway of developer knowledge. For a decade, if you wanted an answer to a programming question, you went through the booth. You tolerated the closure votes, the duplicate flags, the reputation games, because there wasn’t a better alternative.

LLMs removed the toll booth.

Developers didn’t stop traveling. They stopped paying.

What we’re witnessing isn’t the death of developer communication. It’s the unbundling of a monopoly. Stack Overflow tried to be the single source of truth for all technical questions. That model was always fragile—it just took a sufficient technological shock to reveal it.

The new ecosystem is messier. More distributed. Harder to search—though agentic coding infrastructure is changing that quickly. (I’d argue you could learn nearly as much from my mcp-skillset as from Stack Overflow. Better organized, semantic search, also built from community contributions. Without the BS.) But it’s also more human, more specialized, and better matched to how people actually learn and communicate.

The sentiment shift

One last data point. The 2025 Stack Overflow Developer Survey showed 84% of developers using AI tools—but sentiment was mixed. Only 60% viewed AI positively, down from over 70%. Forty-six percent actively distrusted AI accuracy.

That was 2025. Since then, models like Claude Opus 4.5 have made the generative AI question moot. The accuracy concerns that fed developer skepticism are evaporating. When an AI tool can reliably write, debug, and ship production code, the “will I use this?” question becomes “how do I use this effectively?”

The holdouts are running out of reasons to hold out.

Stack Overflow’s collapse isn’t a tragedy. A platform that optimized for definitive answers got replaced by tools that provide them faster. The discussion, community, and deep expertise went elsewhere—to platforms that were better at providing those things all along.

Good riddance. The future is tools for answers and humans for judgment.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on how AI is reshaping developer tools, read my analysis of Your IDE Is a Comfort Blanket or The Age of the CLI. Hat tip to Alex Zoghlin for sharing Peter Coy’s article.