Hyperdev: Reviews

HyperDev’s Three Golden Rules

Robert Matsuoka — Tue, 03 Feb 2026 13:32:23 GMT

The myth that won’t die

AI is magic. Just ask it a question, get an answer. Type a prompt, receive working code. The tools are so good now that methodology doesn’t matter—you can wing it and still get results.

This breaks down the moment you try to build anything real.

This manifesto distills what nine months of daily AI-assisted development has taught me. The research is finally catching up—and the findings match what I was seeing six months before the studies came out.

What follows is the opposite of “just ask GPT.”

First, pick the right tool for the job

In my current workflow, ChatGPT is a pal. Claude is my colleague.

I don’t mean that as shade—and this may shift as tools evolve. ChatGPT excels at casual conversation, quick lookups, creative brainstorming, general-purpose chat. When you want to explore an idea without structure, when you’re killing time, when you need something that feels effortless—ChatGPT is designed for that experience.

But professional AI work requires different architecture. Claude.AI offers project-based organization with persistent context, custom instructions per workspace, and document integration that maintains state across sessions. Analysis of 2,500+ repositories found that project-aware approaches like Cursor achieved 39% higher merged PR rates—a University of Chicago team measuring real GitHub outcomes, not survey data. Stateful conversational systems consistently show higher user satisfaction over stateless alternatives in controlled studies.

This isn’t because of model quality—Opus 4.5 and GPT-5.2 trade benchmark scores depending on the task. It’s because of infrastructure. ChatGPT treats each conversation as ephemeral. Claude Projects treat each workspace as a persistent collaboration environment with its own knowledge base, instructions, and accumulated context.

The underlying principle transcends any specific tool: project-scoped state beats conversation-level ephemera for professional work. If Claude Projects didn’t exist, I’d build the same pattern with custom tooling. The tool matters less than the architecture.

Professional work happens in projects. So that’s where the methodology starts.

Rule 1: Don’t write your own prompts

Every project gets its own detailed instructions—written by Claude, not by you.

This sounds counterintuitive. Why would you outsource prompt writing to the same system you’re prompting? Because research consistently shows AI-optimized prompts outperform human-written ones by significant margins.

Google DeepMind’s OPRO study found LLM-generated instructions beat human-designed prompts by up to 50% on Big-Bench Hard tasks. The Suzgun & Kalai Meta-Prompting Framework showed meta-prompting surpassed standard prompting by 17.1%, expert prompting by 17.3%, and multipersona prompting by 15.2%. A 2025 ScienceDirect study on industrial applications found “the structure of the prompt itself had greater influence on determinism and correctness than the choice of LLM.”

The methodology: When starting any Claude Project, I ask Claude to write the project instructions. I describe what I’m building, what matters, what I want emphasized. Claude drafts instructions optimized for how Claude processes context. I review, refine, iterate—but I’m editing Claude’s work, not writing from scratch.

Here’s why this works: LLMs have specific attention patterns, context prioritization quirks, and interpretation biases. A prompt optimized for how the model actually processes language will outperform one optimized for how humans think prompts should work. The SPRIG Framework found a single AI-optimized system prompt performs on par with task-specific prompts across 47 different task types—and those optimized prompts generalize across model families, parameter sizes, and languages.

One caveat matters: A 2025 MIT Sloan study found automatic prompt rewriting degraded performance by 58% when it overrode user intent. Human oversight of AI-generated instructions remains essential. Don’t just accept whatever Claude produces—review it, test it, refine it. But start from Claude’s draft, not your own.

For knowledge work specifically: Each Claude Project in my workflow has detailed custom instructions even when the work is generic. Writing projects get tone guidance, formatting preferences, citation requirements. Research projects get source evaluation criteria, synthesis approaches, output structure. Coding projects get architecture patterns, testing expectations, documentation standards. The instructions aren’t boilerplate—they’re optimized context that shapes every response.

Rule 2: Make your context searchable, not just present

Cramming documents into a project isn’t context management. It’s hoarding.

The Lost in the Middle phenomenon, documented by researchers at Stanford and UW, showed accuracy degrades over 30% when critical information sits in the middle of long contexts. Chroma’s July 2025 study evaluating 18 LLMs confirmed that performance degrades as input length increases—even on simple tasks. More surprisingly, shuffled incoherent context sometimes outperformed logically structured input.

More context isn’t automatically better. Strategic context is better.

The foundation is working in defined projects. Every significant task gets its own Claude Project—not a continuation of some sprawling conversation that’s accumulated context about twelve different topics. A project for client work. A project for research. A project for each codebase. The discipline of organizing work into discrete, focused workspaces is where methodology starts.

Claude.AI automatically RAG-indexes documents you add to projects. This matters. Your documents aren’t just sitting in the context window hoping Claude notices the relevant paragraph. They’re indexed and retrieved semantically—Claude searches them based on what’s relevant to your current query. This turns document-heavy projects from context-window-stuffing into actual knowledge bases.

But RAG indexing only works when you give it something worth indexing. I’m deliberate about what goes into each project:

Project-specific documentation (not everything that might be tangentially relevant)
Reference material I’ll need repeatedly
Examples of the output quality I want
Previous work that should inform new work

My implementation adds a second layer for code: I use mcp-vector-search, an MCP server that AST-chunks all code in a project and makes it semantically searchable. This matters because of how code context works.

The cAST paper from Carnegie Mellon found AST-aware chunking improved code generation Pass@1 by 2.67-5.5 points across benchmarks, with Recall@5 improvements of +4.3 points. The principle: chunk boundaries aligned with complete syntactic units—functions, classes, modules—preserve semantic integrity that naive text splitting destroys.

When Claude needs context about a specific function, it searches the vector database and retrieves the relevant code chunks with their surrounding context. It’s not wading through 50,000 lines hoping the important bit isn’t lost in the middle. It’s querying a structured knowledge base and getting precisely what’s relevant.

For knowledge work without code: The same principle applies. Don’t dump every related document into a project and hope for the best. Curate. Organize. Consider which information actually matters for which tasks. If you’re working with extensive reference material, external RAG systems (research shows knowledge graph-based RAG reduces hallucinations by 20-30%) beat relying on context window alone.

The question isn’t “how much context can I add?” It’s “what context actually improves output quality for this specific task?”

Rule 3: Build verification into your workflow

AI checking AI works—when structured correctly.

Hallucination rates vary wildly depending on how you measure. Vectara’s leaderboard shows 0.7% for the best models on closed-book QA benchmarks, climbing to 29.9% for weaker ones on harder tasks. For code specifically, one study of AI-generated code found security vulnerabilities in 48% of samples—though this varies by language, task complexity, and vulnerability definition. Apiiro research found AI-generated code introduced 322% more privilege escalation paths and 153% more design flaws in their analysis.

The pattern is consistent: unverified AI output ships bugs. Maybe not every time. But often enough that verification isn’t optional.

My verification stack has two components:

First, unit test coverage. When Claude Code generates or modifies code, I require tests for any significant functionality. Not because I don’t trust the code, but because tests surface the gaps between what I asked for and what was implemented. Tests catch the “almost right, but not quite” problem that 66% of developers cite as their top frustration.

Second—and this is the part that surprises people—I use GPT to proofread Claude’s writing output. Different model, different training data, different biases. When GPT catches something Claude missed, that’s signal. When both agree, confidence increases.

State-of-the-art LLMs can align with human judgment up to 85% for evaluation tasks when used correctly. Hallucination detection tools achieve 90-91% accuracy. AI code review tools like CodeRabbit achieve 46% bug detection accuracy versus traditional static analyzers at under 20%, with teams reporting 40% reduction in code review time.

But LLM-as-judge has documented limitations. Hallucination “echo chambers” form when generating and evaluating models share biases. Best practices from the research: use binary yes/no questions (outperform complex scales), prompt LLMs to explain ratings, use multiple evaluators with voting or averaging.

The enterprise standard: 76% of organizations now include human-in-the-loop workflows to catch hallucinations before deployment. Knowledge workers spend an average of 4.3 hours per week fact-checking AI outputs. That time isn’t overhead—it’s the actual cost of making AI work reliable.

For knowledge work: Every significant document I produce with AI assistance gets reviewed by a different model before it ships. Sometimes that’s GPT reviewing Claude. Sometimes it’s Claude reviewing its own earlier work with fresh context. The specific models matter less than the principle: verification is not optional.

The gaps that remain

Two problems Claude.AI doesn’t solve yet.

Cross-project knowledge. Every project is an island. Work you did in one project doesn’t inform another. If you’ve solved a similar problem before, Claude doesn’t know unless you manually copy context over.

You can build this yourself with git—treat your Claude Projects like a monorepo where related work lives in the same repository structure, manually maintaining connections. I expect Anthropic will ship native cross-project search eventually. Until then, it’s manual work or custom tooling.

Temporal decay. Documents you added six months ago sit alongside documents you added yesterday, weighted equally. Static context gets stale. Your project instructions reference approaches you’ve since abandoned. Old architecture docs describe systems that no longer exist.

Claude.AI doesn’t rank by freshness. It doesn’t know that your January notes are probably less relevant than your December work. Kuzu-memory builds freshness weighting into its knowledge graph architecture—newer information surfaces more readily. But that’s external tooling solving a problem the platform should handle natively.

Both gaps will close. The question is whether you wait for Anthropic or build workarounds now.

The implementation pattern

Here’s what this looks like in practice:

Starting a new project:

Create a Claude Project—one project per distinct area of work
Add curated documents that will be RAG-indexed automatically
Ask Claude to draft project instructions based on what I’m building
Review and refine those instructions, testing with sample tasks
For code projects, initialize mcp-vector-search to create searchable code context

During work:

Work within the project, maintaining accumulated context
For code: rely on vector search for targeted retrieval rather than full-codebase prompts
For knowledge work: let Claude’s RAG indexing surface relevant document sections
Track what context actually improves outputs and prune what doesn’t

Before shipping:

For code: run tests, check coverage, verify the implementation matches intent
For writing: have a different model review for errors, inconsistencies, hallucinations
For anything significant: human review of the AI-verified output
Document what worked for future project instructions

What this isn’t

This manifesto isn’t about making AI slower or more bureaucratic. It’s about making AI reliable enough to trust with professional work.

If you’re exploring an idea, prototyping something throwaway, or just need a quick answer—open a chat, ask a question, move on. Nothing wrong with casual AI use for casual tasks.

But the moment stakes rise—production code, client deliverables, work you’ll be accountable for—methodology matters. The developers in the METR study who were 19% slower? They were using AI the casual way—open a chat, ask a question, accept the output, repeat. No structured context. No optimized prompts. No verification layer. They felt faster while actually slowing down.

The teams seeing 27-39% productivity gains for junior developers in the MIT/Harvard/Microsoft field experiment? They had structured workflows, integrated tooling, and verification processes.

The difference is methodology.

The bottom line

Professional AI work in 2026 requires three things:

Structured instructions: Don’t write your own prompts from scratch. Have AI draft optimized instructions, then review and refine. The research shows 17-50% improvement from structured prompting approaches—gains you’re leaving on the table with casual use.

Strategic context: More isn’t better. Searchable, organized, task-relevant context beats document hoarding. Work in defined projects. Let Claude’s RAG indexing do its job. Add AST-aware code chunking for development work. Deliberate document organization following the U-shaped attention curve.

Systematic verification: AI checking AI works when structured correctly—different models, binary evaluation questions, multiple passes. But human-in-the-loop remains the enterprise standard because 46% of knowledge workers report making mistakes based on AI hallucinations.

The opposite of “just ask GPT” isn’t slower, more complicated AI use. It’s AI use that actually delivers the productivity gains the marketing promises but casual usage fails to achieve.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on multi-agent orchestration systems, read my analysis of what a Pattern Master does or my deep dive into the benefits of multi-agent orchestration.

When “Claude Code for Productivity” Meets Reality

Robert Matsuoka — Thu, 15 Jan 2026 20:07:13 GMT

I’ve been running Claude.AI Desktop as a lightweight “AI OS” on my Mac for months. The connector ecosystem turned it into my command center—Linear integration for project management, Google Drive for document access, the whole MCP stack piping context wherever I need it. Works well. Mostly.

The crashes are what finally broke my patience. Long-running research tasks, complex context synthesis, anything pushing past an hour—Claude.AI Desktop would inevitably barf and lose everything. No graceful degradation. Just gone.

So when Anthropic announced Cowork as “Claude Code for the rest of your work”? That caught my attention. This came right on the heels of /chrome going to beta for all paid users—the browser agent that lets Claude navigate websites, fill forms, and execute multi-step workflows. Two releases in quick succession that look like the first real push to close the gap between developer-focused Claude Code and everyone else.

Here’s where things get interesting: I’m on vacation this week, but I’m about to step into a new CTO role (announcement coming soon). I wanted to do proper research on my new company—reviewing their Github repos, digging through Confluence documentation, mapping out their JIRA workflow. Tasks I would have tackled with Claude Code before, but figured Cowork might handle just as well.

The verdict? Cowork is a solid step forward with one critical gap that’ll frustrate power users.

What Works

Cowork handles long-running tasks dramatically better than Claude.AI Desktop. The architecture borrowed from Claude Code makes an immediate difference.

Simon Willison’s analysis revealed that Cowork spins up a full Linux virtual machine using Apple’s VZVirtualMachine framework. Your designated folder gets mounted into a containerized environment. This explains the stability—Claude isn’t wrestling with macOS filesystem quirks directly, and crashes in the VM don’t take down your whole session.

Like Claude Code, Cowork sets up to-do lists for complex tasks, breaks work into manageable chunks, and reports progress as it goes. I kicked off a documentation analysis task and walked away for two hours. Came back to find structured notes organized by topic, saved right where I asked for them on my local filesystem.

VentureBeat’s coverage mentioned that Anthropic built Cowork in roughly a week and a half—Boris Cherny confirmed this timing on X. That recursive approach (building Cowork with Claude Code) shows in how the tool handles work. It thinks in files and folders, not conversation turns. Output lands where you can use it.

For straightforward productivity tasks—organizing a mess of screenshots into a spreadsheet, sorting a chaotic downloads folder, drafting reports from scattered notes—Cowork does the job without hand-holding.

Where It Falls Short

But here’s the thing: Cowork is no Claude Code.

My research project needed context that spanned multiple systems. Github code structure, Confluence documentation patterns, JIRA workflow history. Each piece informed the others. I expected to orchestrate this from Claude.AI—have it delegate tasks to Cowork, maybe kick off /chrome sessions to pull information from web interfaces, then synthesize everything back into my working context.

That coordination doesn’t exist yet.

What I found instead: each tool operates in isolation. The Register’s coverage captures the limitation—Cowork sessions don’t persist memory across logins, and there’s no cross-device sync. More critically for my use case, there’s no native handoff between Claude.AI conversation context and Cowork’s working memory.

I ended up building my own bridge. Created a folder that both Cowork and I could access, used artifacts as handoff files, manually copied context back and forth between sessions. Functional but clunky. Still faster than the alternative—I wouldn’t have attempted this research at all without Cowork (or Claude Code). The coordination overhead was real, but so was the output.

One TechPlanet user’s early report echoed this: “while Cowork can handle complex tasks, it sometimes lacks the nuanced understanding needed for context-specific decisions.”

Multi-session work with accumulating context? Still painful. Each Cowork session starts fresh, and you’re responsible for priming it with whatever prior work matters.

The Integration Gap

What I expected: Claude.AI as the orchestration layer. Feed it a complex research goal. Have it break the work down, delegate filesystem tasks to Cowork, dispatch browser automation to /chrome, then pull everything back together with its memory of our ongoing conversation.

What I got: Three capable tools that don’t talk to each other without significant manual intervention.

Anthropic will probably close this gap—the Connectors architecture for linking to third-party services already exists in Claude Desktop. Axios notes that Cowork can combine with Claude’s other features like the Gmail connector. But the coordination layer that would make these tools feel like one coherent system? Not there yet.

For developers comfortable with Claude Code, you already have workarounds. Shell scripts. MCP servers. Custom tooling. Cowork isn’t adding much if you’re already operating at that level.

For the productivity users Anthropic is targeting with the “not just developers” messaging? The isolation between tools will feel arbitrary. Why can’t I just tell Claude to “research this company using whatever tools you need and summarize what you find”?

The Upside

After a week of testing, I’ve settled into a workflow: Claude.AI handles memory and complex context management. It holds the ongoing conversation, maintains my research threads, keeps track of what we’ve already figured out.

Cowork is a capable executor for tasks that benefit from filesystem access and longer attention spans. The stability alone makes it worth using for anything that would have crashed Claude.AI Desktop before.

The gap is coordination. Right now, I’m the integration layer—shuffling context between tools, managing the handoffs, keeping track of what each system knows.

That’s manageable for someone who’s used to working this way. Less acceptable for the mainstream productivity users this launch targets.

The Bigger Picture

The pattern unfolding in software engineering is hitting productivity work now. Two sectors should be watching.

First: SaaS no-code solutions. Zapier, Make, all those 70% solutions that almost do what you need but never quite get there. I’ll use Cowork over any of them for complex automation, even with current limitations. If Anthropic ships cloud/desktop sync—and they’re clearly building toward it—these platforms will need to move significantly further up the stack or watch their core use cases evaporate. The friction of “almost works” loses to “actually works” every time.

Second: certain productivity roles. The pattern I wrote about Monday with pattern masters applies here. Roles heavy on information gathering, document synthesis, and routine coordination are compressing fast. The roles expanding: strategic thinking (someone still has to decide what to build), pattern recognition and synthesis (connecting dots across domains), human service work (empathy doesn’t automate), and integrators who can scale operations across systems.

The transition won’t be uniform—smaller orgs will feel it first, regulated industries slower—but the directional pressure is obvious.

The Reality of Research Preview

Fortune’s analysis positions Cowork as a competitive threat to AI productivity and automation startups. Maybe eventually. Right now, it’s what Boris Cherny said it was: “early and raw, similar to what Claude Code felt like when it first launched.”

The pricing reinforces this. $100 to $200 per month for Claude Max, which you need to access Cowork at all. That’s enterprise money for a research preview that doesn’t yet integrate with the rest of Claude’s ecosystem.

I’ll finish my company research when I get home, probably back in Claude Code where multi-session context management works better. But I’ll keep testing Cowork for the standalone tasks where its stability advantage matters.

What I’m really waiting for: seamless coordination between Claude.AI’s conversation context, Cowork’s filesystem capabilities, and /chrome’s browser automation. When those three tools talk to each other natively, we’ll have something worth getting excited about.

Until then? Cowork is a solid research preview that shows where Anthropic is heading. Just don’t expect it to replace your existing workflows quite yet.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on AI productivity tools, read my analysis of Claude Code’s evolution to agent orchestration or my deep dive into the future of software engineering.

I Was Wrong About AntiGravity

Robert Matsuoka — Mon, 01 Dec 2025 15:03:13 GMT

I published a snarky article yesterday (Thanksgiving day) morning trashing AntiGravity and questioning Google’s scattered, unfocused strategic direction. I need to do a bit of a mea culpa.

Not because I was wrong about the coding capabilities. Not because Google suddenly has a coherent strategy. I stand by all of that.

I was wrong because I completely missed the real opportunity.

AntiGravity may end up being the best end-user browser testing tool we have.

The Penny Dropped Mid-Afternoon

Here’s what happened. I’d been grinding through my usual workflow—Claude Code for the heavy lifting, Augment Code for quick iterations—when I needed to verify some UI changes across a client project. The tedious kind of work. Click through forms, check responsive behavior, verify that the dropdown actually drops down.

Normally this means either scripting something in Playwright (time I don’t have), using Safari with AppleScript (painful), or just... clicking through things manually like it’s 2015.

On a whim, I pointed AntiGravity at my codebase and asked it to review my recent commits, understand what I’d changed, and test the affected UI components.

And then I watched something impressive happen.

A browser window appeared with a cool blue aura—Google’s visual indicator that the agent has taken control. It started navigating to pages I’d modified. Filling out forms. Scrolling through content. Moving between sections. All while the agent in the IDE was tracking everything: console logs, network requests, visual state changes.

Slow? Yes. Deliberate, even. But completely autonomous and remarkably thorough.

What Makes This Different

I know what you’re thinking. Playwright exists. Puppeteer exists. I’ve written my own MCP browser plugin that I was planning to finish “someday.” There are other browser MCP tools out there. Hell, you can cobble together Safari automation with AppleScript if you hate yourself enough.

But all of that misses the point.

The integration here runs deep. This isn’t just browser automation bolted onto a coding tool. It’s a single agent that can:

Read your entire codebase
Understand your recent commits and what changed
Either read an existing test plan or build one based on what you’ve touched
Execute that test plan in an actual browser
Report results back with console logs, screenshots, and video recordings

That last part matters. When something breaks, you get the actual console output. Not “test failed on line 47.” The real error messages from your real application.

Ihor Sasovets, a former SET engineer, spotted this immediately: “As an ex-SET engineer, I know how much time you spend on selecting the right selector, adding necessary waits and so on. I believe that Antigravity can help you to automate a lot of the processes.”

Simon Willison noted that it “plays a similar role to Playwright MCP, allowing the agent to directly test the web applications it is building.”

So I’m not the only one who noticed. But I might be the first to call it out as the primary use case rather than a side feature.

A Workflow That Works

Here’s the scenario that every engineer knows and every QA team hates:

You’ve done a few days of coding. Features are “done.” Time to hand it off to QA. But you haven’t really tested it yourself because—let’s be honest—browser testing is tedious as hell and you’ve got other things to do.

Now instead of building a test plan manually or throwing it over the wall with a vague “I changed some stuff in the checkout flow,” you can:

Point AntiGravity at your branch
Ask it to review your commits from the past few days
Have it build a user-facing test plan based on what you actually changed
Say “now run it”

The agent fires up a browser, takes over with that blue aura, and starts methodically working through test scenarios. Form submissions. Navigation flows. Error states. Responsive behavior.

When it’s done, you’ve got documentation of what was tested, screenshots of key states, and a clear record of any failures—complete with the technical context from your codebase that a human QA tester would have to ask you about.

Google Didn’t Intend This

Let me be clear: this is not what Google built AntiGravity for.

The official announcement positions it as an “agentic development platform” where agents “autonomously plan and execute complex, end-to-end software tasks.” Browser control is framed as verification—the agent writes code, then checks that it works.

Nowhere does Google say “hey, use this as your QA automation tool.”

But that’s often how the best use cases emerge. Slack was supposed to be a gaming company. Instagram started as a check-in app. Sometimes you build something and users find the actual value.

I’m predicting that browser testing will become AntiGravity’s killer app. Not because Google intended it, but because nothing else combines codebase understanding with browser control in quite this way.

The Token Reality Check

Now the bad news.

One small feature testing session—maybe 20 minutes of browser interaction—ate through my entire free token allocation.

This won’t be cheap to run regularly.

The free tier is basically enough to see if the tool works for you. Any serious usage is going to require paid tokens, and based on my initial experience, the cost-per-test-session is going to be meaningful. We’re not talking Playwright-level costs here. We’re talking LLM inference costs for an agent that’s reasoning about your codebase and orchestrating browser actions simultaneously.

For individual developers doing occasional testing? Probably fine. For teams wanting to run comprehensive test suites? Budget accordingly.

Where This Fits In My Stack

I’m not dropping Claude Code. I’m not abandoning Augment Code. AntiGravity’s actual coding capabilities remain... let’s call them “developing.”

But I can see this becoming my third most-used tool, specifically for one narrow but persistent pain point: technical code review paired with user experience validation.

The integration between “I can read all your code” and “I can also control a browser” is something I haven’t seen anywhere else. That combination solves a specific problem I’ve been hacking around for months.

My half-finished MCP browser extension? Deleted. The janky AppleScript automation I cobbled together? Gone.

The Caveats You Should Know

A few things to keep in mind before you get too excited:

It’s slow. Not “annoyingly slow” but definitely “watch it happen rather than fire and forget” slow. Each action requires reasoning, and reasoning takes time.

Stability is inconsistent. DevClass reported “model provider overload” errors and repeated agent terminations during their testing. I hit similar issues—not constantly, but often enough to notice.

Security concerns are real. Johann Rehberger documented multiple vulnerabilities including remote command execution and data exfiltration risks. The tool has broad system access by design. That cuts both ways.

Chrome only. No Firefox. No Safari. No WebKit testing. If cross-browser validation matters to your workflow, you’ll still need traditional tools.

Bottom Line

I was wrong about AntiGravity. Not about the coding quality (still mediocre) or Google’s strategy (still scattered). I was wrong about where the value lives.

There’s nothing quite like this for integrating technical understanding of a codebase with the ability to carefully control a browser. The agent doesn’t just click buttons—it knows why it’s clicking buttons because it read the code that made them.

Google probably didn’t plan for this to be the headline feature. But I predict it will be.

Just budget for the tokens.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on AI coding tool evaluation, read my analysis of The AI Coding Tools Market Correction or my deep dive into Multi-Agent Orchestration in Practice.

Hero Image Prompt: A browser window with a subtle glowing blue aura around its edges, showing a partially-filled web form. Code editor visible in the background, slightly out of focus. The browser dominates the foreground with a sense of autonomous control—no human cursor visible, form fields filling themselves. Cool blue and purple tones, technical but approachable aesthetic. 4:3 aspect ratio.

OpenAI’s Codex-Max Solves Q1 2025’s Problem in Q4 2025

Robert Matsuoka — Thu, 27 Nov 2025 15:00:37 GMT

I spent the day putting GPT-5.1-Codex-Max through real-world testing on actual production features. Not demos, not toy projects - features for my AI Power Rankings site, tracked through proper tickets, evaluated by my multi-agent orchestration system.

The results? Claude MPM gave it a B-. Not my assessment - an AI orchestration framework evaluating another AI tool based on delivered work quality.

That grade tells you everything about where Codex-Max stands: technically competent, meaningfully improved over the original Codex, and fundamentally too late to matter competitively.

TL;DR

Tested Codex-Max on two production features for AI Power Rankings, tracked via mcp-ticketer with full receipts
Claude MPM evaluation: B- grade on delivered work quality
Significant improvements over original Codex - better UX, stronger coding capabilities
Critical gap: No local operations, no deployment automation, no QA integration
Augment Code’s single agent handles full lifecycle including deployment/QA that Codex-Max can’t touch
Bottom line: Solving Q1 2025’s problem in Q4 2025 when the market has moved to complete workflow orchestration
Better than vibe coding, but that’s no longer the competitive bar

The testing setup: real features, real tracking

I don’t do toy demos. When I evaluate tools, I use them on actual production work and document everything. For Codex-Max, I assigned two features on AI Power Rankings:

Feature enhancement for tool comparison views
Data integration improvements for ranking algorithms

The methodology was straightforward: Codex-Max handled the full implementation, creating and managing its own tickets through mcp-ticketer - its MCP integration works fine. When it was done (or done-ish), I brought in Claude MPM to assess the results and fix issues, also using proper ticket tracking and updates. This isn’t me giving vibes-based opinions - it’s one AI system reviewing another’s delivered work based on measurable outcomes.

The receipts are public: GitHub Issue #52 shows exactly what was attempted, what succeeded, what failed, and how long everything took. You can see the actual code Codex-Max produced in the implementation branch.

What Claude MPM found

The B- grade breaks down into specific strengths and weaknesses:

What Codex-Max got right:

Clean, readable code generation
Reasonable architectural decisions for straightforward features
Significantly better UX than the original Codex
Multi-file editing that actually worked most of the time
GitHub integration that properly understood PR context

Where it fell short:

No ability to handle local deployment workflows
Zero QA automation capabilities
Cloud-only execution created friction for iterative testing
Context preservation degraded on longer tasks despite compaction claims
Required more hand-holding than expected for edge cases

The specific example that crystallized the problem: When the feature needed deployment validation and QA checks, Codex-Max just... stopped. It generated the code, opened the PR, and that was it. My agents handle local ops deployment and QA automatically - Codex couldn’t touch any of that workflow.

The timing problem: Q1 in Q4

Here’s what makes this interesting: If I’d had Codex-Max five months ago, in Q2 2025, I would have been thrilled. The improvements over the original Codex are substantial. The UX polish is real. The coding quality bump matters.

But we’re not in Q2 2025. We’re in Q4 2025, and the market evolved faster than OpenAI shipped.

What would have been competitive in January 2025:

Better code generation than GitHub Copilot
Multi-file editing with reasonable accuracy
Cloud execution for async workflows
GitHub integration for PR reviews

What the market actually looks like in November 2025:

Complete workflow orchestration from specification to deployment
Multi-agent systems handling specialized tasks in parallel
Local operations integration as table stakes
Full QA automation and testing frameworks
Context preservation across days-long development sessions

Codex-Max improved a single-agent cloud execution model. The market moved to orchestrated multi-agent systems with full lifecycle coverage.

In AI development tools, nine months isn’t a product cycle. It’s a generation.

What modern tools handle

My daily driver is Claude MPM on top of Claude Code - that’s what handles my orchestration needs. But here’s what makes Codex-Max’s limitations so stark: Even Augment Code’s single agent handles everything Codex-Max can’t.

To be fair to Codex-Max: I didn’t spend as long building scaffolding for it as I would for Claude Code. I gave it an AGENT.md file (converted from my CLAUDE.md), but I didn’t invest the same level of context setup. That said, the comparison to Augment Code remains instructive - Augment doesn’t need extensive scaffolding because it builds a thorough semantic index of your entire codebase automatically. That’s where modern tools actually are.

Deployment automation: Augment Code manages local deployment, runs validation checks, and handles rollback if needed. Codex-Max stops at code generation.

QA integration: Augment Code triggers test suites, analyzes failures, and iterates on fixes. Codex-Max requires manual testing.

Context acquisition: Augment Code builds semantic indexes automatically, understanding codebases without manual scaffolding. Codex-Max relies on explicit configuration and context seeding.

Local operations: Augment Code executes locally with full filesystem and process access. Codex-Max lives in cloud sandboxes with limited local integration.

Modern tools handle the complete development lifecycle with minimal setup overhead. Augment Code does it with a single agent and automatic context acquisition. I do it with Claude MPM’s multi-agent orchestration and explicit context management. Both approaches work - they just represent where the market actually is.

The fact that a single modern agent handles what Codex-Max can’t shows this isn’t a fundamental AI limitation. It’s an architectural choice OpenAI made that the market already moved past.

What Codex-Max delivers

Credit where it’s due: OpenAI shipped meaningful improvements over the original Codex.

The UX polish is real. The terminal interface works smoothly. The GitHub integration feels native rather than bolted-on. The multi-file editing produces sensible diffs. The cloud execution model eliminates local environment concerns for teams that want that abstraction.

The coding quality improved. Comparing Codex-Max output to original Codex shows clear advancement. The code is cleaner, the architectural decisions are more sound, and the edge case handling is better. That 77.9% SWE-bench Verified score isn’t marketing fiction.

The compaction technology works. Tasks do run longer than the original Codex’s hard limits. The 24+ hour sessions OpenAI advertises are real, though in practice I found context degradation kicked in well before that for complex features.

For teams that don’t have orchestration frameworks, don’t need deployment automation, and primarily want better code generation with nice GitHub integration, Codex-Max represents a solid upgrade. It’s a good tool.

It’s just a good tool for a problem space that evolved past it.

Better than vibe coding—but that’s not the bar anymore

Codex-Max beats casual “vibe coding” - developers prompting ChatGPT or Claude for code snippets and copying them over.

That’s not the competitive bar anymore.

The bar in Q4 2025 is:

Complete workflow automation from specification to deployment
Multi-agent orchestration handling specialized tasks in parallel
Full lifecycle coverage including ops, testing, and monitoring
Context preservation across multi-day development sessions
Local and cloud integration based on actual workflow needs

Codex-Max cleared the “better than copy-paste from ChatGPT” bar. It didn’t clear the “competitive with actual orchestration frameworks” bar.

For hobbyists and individual developers doing straightforward features, Codex-Max might be an improvement over their current workflow. For teams doing serious development work with orchestration systems, it’s a step backward from where they already are.

The architectural problem: single-agent vs. orchestration

Codex-Max improved a single-agent model. The market moved to orchestrated multi-agent systems.

Single-agent limitations:

One context window (even with compaction)
One reasoning path
One set of tool integrations
Sequential execution only

Multi-agent advantages:

Specialized agents for different tasks (research, coding, testing, deployment)
Parallel execution across workstreams
Persistent shared memory across agents
Coordinated hand-offs at natural boundaries

I can spin up Claude MPM with specialized agents for:

Specification agent: Analyzes requirements and creates detailed specs
Architecture agent: Designs system structure and integration points
Implementation agent: Writes code following architectural guidelines
Testing agent: Creates test suites and validates functionality
Deployment agent: Handles local ops, QA checks, and rollout
Review agent: Evaluates overall work quality and assigns grades

That’s not aspirational. That’s what evaluated Codex-Max and gave it a B-.

OpenAI optimized the wrong abstraction. They built a better single agent when the market needed orchestration infrastructure.

Market context: the window closed

The timing becomes even more painful when you look at the competitive landscape:

Claude Code Max ($200/month): 7-hour autonomous sessions, local-first execution, full MCP support for tool integration, sub-agent coordination

Augment Code ($50-100/month): Complete lifecycle automation, deployment integration, single-agent simplicity with full capability

Cursor ($20-200/month): Model-agnostic deep indexing, local execution, proven developer adoption

GitHub Copilot ($10-39/month): Ubiquitous IDE integration, massive installed base, continuous improvement

Codex-Max at $200/month (Pro tier) matches Claude Code Max’s price but not its capabilities. It matches the capabilities of tools costing 1/4 the price but lacks their features.

The positioning is stuck between market segments. Too expensive to be the “good enough” option. Not capable enough to be the “complete solution” option.

What this means for OpenAI

This isn’t just a product timing miss. It reveals a strategic challenge:

OpenAI treats AI coding tools as improved single agents. The market treats them as orchestration infrastructure for multi-agent workflows.

That’s a fundamental difference in how you approach product development:

Single-agent optimization focuses on:

Better code generation quality
Longer context windows
Faster execution
Smoother UX

Orchestration infrastructure focuses on:

Agent coordination protocols
Shared memory systems
Tool integration frameworks
Workflow automation patterns

OpenAI delivered the former. The market needs the latter.

The compaction technology is clever. The UX improvements are real. The coding quality gains matter. But they’re all optimizations of an architecture that became obsolete while OpenAI was building it.

The larger pattern

This isn’t unique to Codex-Max. I’m seeing the same pattern across the AI tools market:

Companies that win: Ship fast, iterate based on actual usage, adapt to market evolution

Companies that struggle: Build technically impressive solutions to problems that moved on

OpenAI’s challenges:

Long development cycles
Feature announcements without product availability
Building for what the market was, not what it’s becoming

The market’s evolution:

Quarterly shifts in competitive dynamics
Rapid adoption of new paradigms
Ruthless abandonment of tools that don’t keep pace

In fast-moving technology markets, timing isn’t just important - it’s definitional. A technically superior solution shipped nine months late loses to an adequate solution shipped on time.

Bottom line

OpenAI built a good product that arrived after the market moved on. Codex-Max would have been competitive in Q1 2025. It’s not competitive in Q4 2025.

The B- grade from Claude MPM captures this perfectly: Technically competent, meaningfully improved, fundamentally inadequate for current market needs.

For developers evaluating tools:

Skip Codex-Max if: You need deployment automation, local operations integration, QA workflows, or multi-agent orchestration. Use Claude Code Max or Augment Code instead.

Consider Codex-Max if: You want better code generation than Copilot, like cloud execution, primarily work with GitHub, and don’t need full lifecycle automation. At $20/month (Plus tier) it might make sense.

In AI development tools, market evolution outpaces product development cycles. Building the best version of yesterday’s architecture loses to shipping adequate versions of tomorrow’s architecture.

OpenAI solved Q1 2025’s problem in Q4 2025. That’s not fast enough.

I’m Bob Matsuoka, building multi-agent orchestration systems and writing about practical AI development at HyperDev. For more on AI coding tools in practice, read my analysis of Claude Code’s multi-agent capabilities or my comparison of orchestration frameworks in production.

I Hate PowerPoint Even More

Robert Matsuoka — Thu, 13 Nov 2025 15:03:30 GMT

Yes, I still hate Powerpoint.

Back in June, I wrote about how PowerPoint kills creativity through forced context-switching. My solution? Stay in Claude.AI’s artifact system, generate React slideshows, dodge the whole presentation software problem.

Then I actually used Gamma. For real work. Fifty-plus presentations since June.

And it’s... complicated.

Here’s what happened: I needed client decks fast. The Claude.AI artifact approach works great for internal stuff, but clients expect actual presentation files they can edit. So I started having Claude.AI generate Gamma-optimized prompts from my writing, then feeding those into Gamma’s AI.

The workflow stuck because it actually works.

The Workflow That Accidentally Solved My Problem

My process now looks like this: Write the core content in Claude.AI (where I’m already thinking), ask Claude to “generate a Gamma-optimized prompt from this,” paste that into Gamma, wait about 30 seconds.

What comes out? Surprisingly good first drafts.

Gamma’s gotten noticeably better at data layouts. Early on, every deck looked identical—same card structures, same visual hierarchy, same boring templates. Now there’s actual variety. Charts that make sense. Diagrams that don’t look like they came from a 2010 PowerPoint template pack.

The image generation is genuinely fast. Much faster than DALL-E 3, which matters when you’re iterating on client deliverables with tight deadlines. Quality’s solid for presentation purposes—not fine art, but better than stock photos.

Speed became the killer feature. From prompt to reviewable deck in under a minute. For comparison, building the same deck manually in PowerPoint? Easily an hour. Even my Claude artifact approach took 15-20 minutes of tweaking.

Gamma also has a new agent which solves the biggest problem with decks — one way flow (the difficulty of changing content in slide format). The agent is basically a format-aware LLM that can make prompt-driven changes ON THE FLY, respecting the structure of the deck. It’s magic. Try this in Powerpoint: “Let’s fix “We” vs “I”. I am the author of the proposal, and should be shown for any action in the proposal. ‘We’ can be used to imply that implementing action is a team effort.”

What Still Sucks

Themes remain formulaic. You can spot a Gamma deck instantly. That “modern, clean, professional” aesthetic they’re known for? It’s also their limitation. Every deck feels like it came from the same design system because, well, it did.

Custom branding takes work. Not impossible, but friction exists. Clients with strict brand guidelines require manual adjustment of colors, fonts, layouts. The AI doesn’t learn your brand preferences across presentations.

Export compatibility has gotten better but still breaks occasionally. PowerPoint imports lose some formatting. Google Slides handles it reasonably well. PDF works perfectly, which is what most clients care about anyway.

The big context-switching problem from my original article? Still there. I’m still developing ideas in Claude.AI, then moving to Gamma to package them. Better packaging doesn’t solve the fundamental workflow break.

The $68M Series B Changes Nothing (And Everything)

Gamma just raised $68 million at a $2.1 billion valuation. Andreessen Horowitz led the round. The company hit $100 million ARR with only 50 employees and claims 70 million users.

Those numbers matter. Not because they make Gamma’s product better, but because they signal where the market’s going.

AI-powered presentation tools aren’t a feature anymore. They’re a category. The $2.5 billion subsegment of presentation software is growing at 25% annually. Traditional slide software (PowerPoint, Google Slides, Keynote) collectively represent $7.8 billion, but they’re legacy infrastructure now.

What’s interesting: Gamma reached profitability and stayed there for 15+ months before this raise. That’s rare in AI tools, where most companies burn through capital chasing growth. The economics actually work.

The funding goes toward addressing weaknesses—customer support infrastructure (Trustpilot shows a brutal 2.1/5 rating) and enterprise features. The $20 million secondary component gave early employees liquidity, which suggests the founders are playing the long game rather than rushing to exit.

Multi-Model Orchestration Actually Matters

Here’s the technical piece worth understanding: Gamma doesn’t use a single AI model. They orchestrate across multiple models—primarily Claude Sonnet and Gemini Flash, with OpenAI tested regularly but not favored.

This matters because different models excel at different tasks. Claude Sonnet handles complex content generation and structural thinking. Gemini Flash provides cost-efficient image generation and quick iterations. The company runs hundreds of A/B tests to optimize model selection per task.

When they tested Claude 3 Haiku, it delivered 30% better user satisfaction and 20% higher free-to-paid conversion. That’s not incremental—that’s the difference between sustainable business and burning capital.

The architecture uses block-based HTML instead of traditional slides. More responsive, more web-native, more flexible than the slide metaphor PowerPoint established 40 years ago. This is what I meant in my original article about dodging the medium that kills flow—Gamma at least tries to escape the rigid slide structure.

Competitive Reality: Nobody Else Gets Close

The competitors

I’ve tested the alternatives. Beautiful.ai focuses on design automation with limited AI content generation. Pricing starts at $12/month but no free tier, positioning it as premium design-first. Their “Smart Slides” auto-adjust layouts nicely, but you’re still building content manually.

Tome raised $81 million and hit 10 million signups fast, but the product feels unfinished. $16-20/month gets you narrative-focused presentations with PDF-only export. The dark-mode-default aesthetic works for pitch decks but fails for client-facing business presentations. Limited to 400-character prompts, so complex decks require extensive manual work.

Canva offers Magic Design for Presentations as one feature among dozens. The $10/month pricing undercuts everyone, but presentations aren’t the focus. Export compatibility to PowerPoint breaks constantly. Format preservation? Forget it.

Microsoft Copilot costs $66-87/month minimum (Copilot plus Microsoft 365 E3/E5 licensing). Enterprise security and Microsoft Graph integration appeal to large organizations, but the economics are brutal for individuals or small teams. Plus, no transitions, videos, or tables in AI-generated slides initially. English-only at launch. Accuracy requires manual review.

Pitch positions as collaboration-first with basic AI (”Start with AI” limited to 400 characters). $20/month per user, $80/month for business tier. Excels at real-time collaboration and analytics, but the AI feels like an afterthought.

Gamma sits in the sweet spot: $8-15/month with genuinely useful free tier (400 one-time AI credits). Better AI content generation than anyone except Microsoft, but without enterprise complexity or costs. Multi-model orchestration surpasses all competitors except Microsoft, and Gamma’s web-native format differentiates from traditional slides.

A Concerning Trust Problem

Here’s what concerns me: Gamma’s Trustpilot rating of 2.1/5 across 39+ reviews represents a reputational risk for a B2B tool claiming 70 million users.

Common complaints? Customer support ignoring refund requests for weeks, AI-generated automated responses with no human follow-up, image quality described as “chaotic and glitchy” for professional use, promised premium features not delivered after payment.

The volume of negative reviews remains small relative to the claimed user base, suggesting either sampling bias toward extremely dissatisfied customers or concerning patterns in user experience that could block enterprise adoption.

For context: Product Hunt rates Gamma 4.6-4.9/5 across 170+ reviews. G2 shows 4.3/5. Microsoft Store rates 4.2-4.3/5. The polarization between professional users (who love it) and dissatisfied customers (who hate customer support) creates risk.

The Series B funding presumably addresses this—50 employees can’t support 70 million users effectively. Scaling customer support infrastructure takes capital and systems, both of which this round should provide.

My Original Critique Stands (Mostly)

Go back to “I Hate PowerPoint.” My argument: any context switch from brainstorming to presenting breaks creative momentum. The medium forces you to chunk ideas into slide formats before the ideas are fully formed.

Gamma doesn’t solve this. You still develop ideas in one place (Claude.AI, docs, notes), then move to Gamma to package them. It’s a better publishing tool, not a thinking tool.

But—and this matters—the workflow I’ve developed minimizes the friction. Claude.AI generates the Gamma prompt from my raw thinking. Gamma builds the deck structure. I review and refine in Gamma rather than rebuilding from scratch.

That’s not elimination of context-switching. It’s reduction of friction within the switch.

The speed advantage matters more than I expected. When you can go from raw ideas to reviewable deck in under two minutes, the context switch feels less disruptive. Not zero, but manageable.

For client work specifically, Gamma solves a real problem: clients expect presentation files, not React component artifacts. The output needs to be editable by non-technical people using familiar tools. Gamma delivers that in a format that imports reasonably well into PowerPoint and Google Slides.

What This Says About AI Presentation Tools

The market validated my “I Hate PowerPoint” thesis in an unexpected way. The problem isn’t PowerPoint itself—it’s the forced chunking of ideas into presentation formats before ideas are fully developed.

AI tools like Gamma reduce the friction of that chunking. They don’t eliminate it.

The $68 million Series B at $2.1 billion valuation signals that investors believe AI-powered presentation generation is a durable category, not a temporary feature. The 25% CAGR growth projection for AI-specific presentation software through 2030 suggests the market expands rather than substitutes.

But here’s what keeps me skeptical: conversational AI interfaces could eliminate standalone presentation tools entirely. Anthropic’s Claude or OpenAI’s ChatGPT could embed presentation generation natively, competing with dedicated tools. Why leave your conversation to generate a deck when the conversation can generate the deck without context-switching?

Gamma’s current advantage—multi-model orchestration, prompt engineering culture, extensive A/B testing—represents a technical moat. But moats erode when platform companies decide to compete directly.

Bottom Line

I’ve created 50+ presentations in Gamma since June. It works. Fast, reasonably good output, improving steadily.

Does it solve the PowerPoint problem? No. It makes the problem more tolerable.

The workflow I’ve developed—Claude.AI for thinking, Claude-generated prompts for Gamma, Gamma for packaging—reduces friction without eliminating the fundamental context switch.

For client-facing work where deliverables need to be presentation files, Gamma beats the alternatives. For internal work or pure creative thinking, my Claude artifact approach still feels cleaner.

The $68 million raise and $2.1 billion valuation validate that AI presentation tools represent a real category with sustainable economics. Gamma’s capital efficiency (reaching this scale on $90 million total raised) and profitability distinguish it from cash-burning competitors.

But the ultimate solution to “I Hate PowerPoint” isn’t better presentation software. It’s eliminating the need to translate ideas into presentation formats at all. When that happens—when conversational AI can deliver the final format without leaving the conversation—tools like Gamma become legacy infrastructure.

For now? Gamma’s the best execution of a flawed category. Which means I’ll keep using it while hoping something better comes along.

Related: I Hate PowerPoint - my original argument about why presentation software kills creative flow, and my follow up first talking about Gamma.

Goose: An Open Source Take on Vibe Coding and Agentic Workflow Automation

Robert Matsuoka — Thu, 30 Oct 2025 14:05:19 GMT

The quickest gains in AI productivity don’t come from replacing senior developers doing complex architecture. They come from automating workflow tasks that slow everything else down—migrations, test generation, documentation updates. Tactical work that needs doing but doesn’t need deep thinking.

I’ve been using Warp for exactly this workflow automation. It’s been effective for my own day-to-day needs. But after my friend Chris Bunk recently introduced me to Goose—he’s another technical leader who’s practicing agentic development—I’m realizing it might be better suited for this specific use case. Not because Warp fails at workflow automation (it doesn’t), but because Goose was built specifically for it without the pressure to become a general-purpose platform.

Goose fits a particular niche—a framework purpose-built for workflow automation without the VC pressure to become everything to everyone.

Block released it in January 2025 with zero monetization strategy. No freemium tier. No enterprise upsell. No growth targets. Just “here’s a thing we built for workflow automation, maybe it’ll work for you too.”

What vibe coding actually means

When Andrej Karpathy tweeted “vibe coding” on February 2, 2025, he captured something real: the difference between building production systems and spinning up tactical solutions. Prototypes. Scripts. Tools that solve immediate problems. The kind of work where you want AI to handle execution while you focus on the actual problem.

Goose landed in this space naturally. Block’s engineers use it for 30-60 minute tasks—code migrations, generating tests, building applications from Figma designs. Not “let’s architect a new service” work. The “let’s just get this done” work that delivers immediate productivity gains. They’re handling infrastructure work, test coverage improvements, documentation updates. The stuff that needs doing but doesn’t need architectural thinking.

Block’s open source bet

Block released Goose with zero monetization strategy. No freemium funnel. No enterprise tier. No conversion targets. The architecture is straightforward: CLI interface + Rust-based desktop app, core execution engine that interprets requests and coordinates tasks, and MCP server extensions. You describe what you want—”migrate this component” or “increase test coverage above 80%”—and it handles execution.

The MCP integration provides connections to hundreds of pre-built servers for GitHub, Jira, Slack, Google Drive, databases, browsers. Not because Block needed that breadth, but because the community built it.

Block CTO Dhanji Prasanna positions it as strategic open source. The Apache 2.0 license means full permissiveness for enterprise customization, modification, whatever. No strings.

The timing mattered. MCP emerged as a potential industry standard, and Block collaborated with Anthropic on protocol development. When OpenAI adopted MCP in March 2025 across ChatGPT and their Agents SDK, followed by Microsoft’s integration through GitHub Copilot Studio and Azure AI, that early positioning paid off.

Financial backing comes from Block’s operating budget. No VC pressure to hit ARR milestones or monetize the user base. The open-source strategy also serves Block’s hiring needs—engineers care about working at companies that contribute meaningfully to open source. Hacktoberfest 2025 participation, grant programs for external developers, active Discord community.

The project picked up significant traction—over 20,000 GitHub stars in nine months. For context, that puts it in the top tier of AI coding tools by community engagement, though still behind established players like Continue (48K stars) or Cursor (25K). The adoption came through engineers finding something that works and telling other engineers about it, not through marketing campaigns.

What it actually does

For vibe coding and workflow automation, Goose delivers autonomous task execution with LLM flexibility. It interprets multi-step requests, generates code across files, executes commands to test implementations, captures and debugs errors, iterates until completion.

LLM support includes Claude (3.5 Sonnet/Haiku), all OpenAI models, Gemini, DeepSeek, local models via Ollama, 20+ providers through the Tetrate router. You’re not locked into any single vendor. For local development with privacy concerns, Ollama integration matters.

The MCP ecosystem bet matters more than current features. As OpenAI, Microsoft, Google DeepMind, Replit and Sourcegraph adopt MCP, the protocol becomes infrastructure. Community-built MCP servers for Stripe, Postgres, or internal systems work with minimal friction across Goose, OpenAI’s agents, any MCP-compatible tool.

Goose vs Warp: Different models, different results

Goose charges nothing for software. You pay LLM API costs—$100-300/month for heavy development usage. Compare that to Warp, which raised $73 million and prices at $15/month Pro, $40/month Turbo, custom Enterprise.

The meaningful difference isn’t features—it’s architecture philosophy. Warp’s multi-threading runs multiple agents simultaneously with a management interface. A consulting firm using Warp’s multi-agent capabilities documented substantial productivity improvements in their case study, though specific figures should be viewed as user-reported results rather than controlled benchmarks. Goose can run multiple instances in different terminal windows, but coordination is manual. Works fine for vibe coding where you’re spinning up quick tasks. Less useful for complex orchestration.

Warp’s $73 million war chest enables engineering velocity that community projects can’t match. They employ approximately 69 people building features, refining UX, closing enterprise deals. Multi-threaded agent execution with orchestration UI, status tracking, completion notifications provides reported time savings for heavy users according to their customer data. For complex orchestration where you’re managing five different workstreams simultaneously, this matters.

Terminal-Bench performance ranks Warp #4 with 52% success rate. The product experience shows notable consumer-grade polish—the kind of attention to detail that reduces adoption friction substantially. Platform integration creates network effects through Warp Drive’s team knowledge base that stores runbooks, workflows, environment variables that agents automatically access.

Goose’s local-first architecture provides different advantages. It runs entirely on-machine by default with zero data transmission except user-chosen LLM API calls. No telemetry. No analytics. No cloud dependencies beyond optional MCP connections. For financial services, healthcare, government agencies, this enables deployment in air-gapped environments. Security teams can review source code line-by-line, modify for internal requirements, deploy in classified environments.

Cost transparency eliminates subscription fatigue: $0 software + ~$100-300/month in API calls. A 500-developer shop pays $90,000-240,000 annually for Warp Pro or Turbo. Goose costs $0 for software plus $120,000-300,000 in LLM expenses—potentially cheaper, though requiring internal IT support.

Both models work. Commercial tools deliver polish and multi-agent orchestration. Community tools deliver transparency and focused workflow automation. Different approaches for different workflow needs.

Where productivity gains actually happen

Test generation? Comprehensive test suites in 20 minutes instead of two days. Documentation updates? Feed Goose your codebase, let it generate accurate docs while you solve actual problems. Dependency migrations? The tedious-but-straightforward work that burns hours—Goose handles it autonomously.

Community response sits around 6.5/10 on enthusiasm—genuine interest without “this changes everything” hype. Engineers appreciate specific solved problems. One developer: “I created a custom CLI command... I don’t know Go that well... Goose did it all for me in ~30 minutes.” Another: “My sister had been asking me for months to help her build a Google Docs extension but I kept putting it off. Today, we built one in just 30 minutes with Goose.”

Adoption patterns show 70% real usage versus 30% curiosity. External validation includes Databricks featuring Goose at their Data + AI Summit 2025, hundreds of community-built MCP extensions.

So what does this tell us?

Goose validates targeted application for workflow automation with clear use cases. Block’s engineers using it weekly for specific infrastructure tasks, migrations, and test coverage work prove the model works—they’re automating tactical work that slows everything else down, not trying to replace core product development with AI.

What Goose shows is what tools look like when commercial pressure doesn’t force them to promise everything. Warp needs to justify $73 million in funding with subscription revenue and enterprise deals. Every product decision gets filtered through monetization strategy. Goose has none of that pressure. Block builds what their engineers need. The community builds what the community wants.

For developers choosing tools, understand what you’re optimizing for. Workflow automation where AI delivers real gains? Goose excels—local-first, cost transparent, focused on the tasks that deliver value. Complex orchestration requiring parallel agent coordination? Warp’s multi-threading provides documented productivity improvements.

Most teams end up using both for different workflow contexts. Workflow automation is where AI coding productivity gains actually happen—and focused tools built without commercial pressure often deliver that value better than platforms trying to solve every problem.

Claude.AI Gets a Tool Facelift

Robert Matsuoka — Sat, 13 Sep 2025 14:10:28 GMT

The Claude.AI editor drove me nuts for months. Not the kind of "minor annoyance" nuts—the "throw my laptop out the window" variety.

You'd think a company building AI assistants could figure out document editing. Wrong. The editor would lose track of changes, ignore edit requests, or just... forget what we were working on mid-conversation. Classic case of brilliant AI trapped in amateur tooling.

That changed this week. Quietly, without fanfare, Claude.AI shipped what looks like a complete editor overhaul. Tool calls now bundle into threads. The interface feels snappier. Most importantly? It actually remembers what we were doing (so far so good). It also doesn't 're-type' the entire artifact to show the change, it only updates the specific section you changed. So feels snappier. I still think we need direct editing (hello GPT Canvas!).

What Was Broken (And Why It Mattered)

The old Claude.AI editor followed what I call the "nuclear option" approach: send everything to the LLM every single time. Want to fix a typo? Upload the entire document. Need to adjust one paragraph? Here's 50 pages of context the AI doesn't need.

This created two massive problems:

Direct editing was impossible. You couldn't just change a sentence—you had to describe the change and hope Claude interpreted your request correctly. Like dictating edits to someone who might not speak your language. This is still broken, IMO, but better.

Context bleeding everywhere. The AI would see every word of your document on every edit, creating bizarre behaviors where it would "fix" things you didn't ask it to touch. Change a title? Suddenly it's rewriting your conclusion.

Worse, the editor kept losing track of incremental changes. I'd request three small edits, and Claude would make two of them, ignore the third, then act like everything was fine. When you pointed out the missing change, it would either claim it already made it or suggest starting over.

Professional writing workflows don't work this way. You need precise control, reliable execution, surgical changes. The old editor felt like performing surgery with a sledgehammer.

The Threading Revolution

The new system bundles tool calls into conversation threads instead of treating each edit as an isolated event. Sounds simple. Changes everything.

Now when I request multiple document changes, they flow together as a coherent conversation rather than disconnected commands. The AI maintains context about what we're working on, remembers the sequence of changes, builds on previous edits rather than treating each one as starting from scratch.

Example: Yesterday I asked Claude to:

Restructure the introduction
Add more specific examples in section three
Tighten the conclusion

Under the old system, this would have been three separate "send the entire document" operations. The AI might have forgotten the introduction changes by the time it reached the conclusion. Or applied contradictory formatting across sections.

The new threading approach handled all three requests as parts of a single editing session. Context carried forward. Changes built on each other. No weird inconsistencies or forgotten requests.

Visual Features That Actually Help

The interface updates aren't just cosmetic. The new visualization features make the editing workflow more transparent—you can see what Claude is thinking about, which sections it's focusing on, how changes connect to each other.

This matters because AI editing requires trust. When the system ignores a request or makes unexpected changes, you need to understand why. The old editor was a black box. Make a request, cross your fingers, hope for the best.

The updated interface shows you what's happening. Not full transparency—it's still an AI making decisions—but enough visibility to build confidence in the process.

What Still Needs Work

The editor improvements are significant, but this isn't a complete transformation. Some fundamental limitations remain:

The "send everything" model persists. Threading makes it smarter, but Claude still processes entire documents rather than making targeted edits. This creates latency and cost issues for larger documents.

Complex formatting gets messy. Rich text, nested lists, image placements—anything beyond basic text structure can confuse the AI. It's better at understanding what you want but still struggles with precise formatting control.

Version control is nonexistent. No branching, no revision history, no way to compare document states. If Claude makes changes you don't like, your only option is asking it to undo them or starting over.

The Bigger Picture

These editor improvements signal something important about AI tool development. The companies succeeding long-term aren't just building better models—they're building better interfaces for working with those models.

Claude's AI capabilities have been strong for months. What held back adoption was the frustrating experience of actually using those capabilities. Fix the interface, and suddenly the AI becomes more valuable.

This pattern repeats across AI tools. GitHub Copilot's success isn't just about code generation—it's about seamless IDE integration. Cursor works because it handles the editing workflow thoughtfully, not just the AI suggestions.

The Claude.AI editor still isn't perfect. But it's finally usable in professional workflows. That's the threshold that matters—not perfect AI, but good enough AI wrapped in tools that actually work.

Should You Try It?

If you've avoided Claude.AI's editor because of past frustrations, it's worth another look. The threading system alone makes document collaboration significantly more reliable.

For professional writing workflows, it's particularly valuable for:

Draft restructuring - The AI maintains better context across large changes
Content iteration - Multiple rounds of edits build coherently rather than conflicting
Collaborative editing - When working with AI on complex documents that need many revisions

The improved editor doesn't change Claude's core AI capabilities, but it makes them accessible in ways that weren't possible before. Sometimes the interface matters more than the intelligence behind it.

Good timing too. As AI writing tools proliferate, the ones that survive will be those that respect how humans actually work—not the ones that force us to adapt to their limitations.

The new Claude.AI editor feels like a step in the right direction. About time.

Augment Code's Auggie: When Focused Single-Process Beats Multi-Agent

Robert Matsuoka — Thu, 04 Sep 2025 17:00:51 GMT

It’s no secret that I've been a fan of Augment Code since March. When I'm not using Claude Code, they're my go-to backup—consistently getting better and notably faster over the past few months. Their advanced context management and code indexing are impressive achievements. So when they announced Auggie, their new CLI tool, I was intrigued.

I've written before about the limitations of working inside VS Code. CLI-based models for agentic coding just work better for me. So on a train ride up to San Francisco today, I decided to give Auggie a real test on a problem that had been bothering me for days.

The Problem That Wouldn't Die

My Claude MPM framework has this 3D graph visualization feature—an AST-based file explorer using D3.js to render code relationships. Claude Code couldn't crack it across multiple sessions. The complexity of coordinating D3.js rendering with file system analysis kept leading us in circles.

Within 45 minutes on that train ride, Auggie and I had it working.

What Makes Auggie Different

Let me be clear upfront: Auggie doesn't support sub-processes or sub-agents like Claude Code does. It's a single-process tool, full stop. No spawning specialized agents for different tasks, no parallel exploration of multiple solution paths.

But here's what struck me: sometimes you don't need an army. Sometimes you need one really good soldier who understands the terrain.

Auggie's Context Engine automatically indexed my entire MPM codebase—all the clues I'd left in various files, the CLAUDE.md instructions, the interconnected module relationships. I fired it up and Auggie just... understood. The tool said at one point, "Now I can see the pattern," when working through the D3.js integration. Pretty cool.

This contextual superiority aligns with what I've seen in my broader tool comparisons. Augment Code consistently excels at whole-codebase understanding.

The Claude Code Influence (In a Good Way)

Augment Code clearly studied Claude Code's interface, and I mean that as sincere flattery. The terminal UI looks remarkably similar—so much so that muscle memory transfers immediately. They've even implemented slash commands that Claude Code users will recognize:

/model for switching between models mid-session
/task for opening the task manager
/github-workflow for CI/CD integration
Custom commands via auggie command

The three operating modes feel natural:

Interactive mode: Full terminal UI with streaming responses
Print mode: auggie --print "instruction" for single commands
Quiet mode: Returns only final output, perfect for piping

Where It Shines

Speed and responsiveness. This thing is fast. Noticeably faster than Claude Code on similar tasks, and much smoother than Codex CLI or Gemini CLI. Response streaming feels instantaneous. They’re using the same models, but CC add’s processing overhead in it’s hugely complex client which Auggie avoids.

Context understanding. The automatic codebase indexing is remarkable. I didn't have to explain my project structure or manually include files. It discovered the relationships between modules, understood my documentation patterns, and applied that knowledge to solve the D3.js problem.

Focus. Without sub-agent capabilities, Auggie forces you to think differently. Instead of orchestrating multiple agents, you're having a focused conversation with one highly capable assistant. For the Three.js debugging session, this constraint actually helped—we stayed on target instead of exploring tangential paths.

Input validation against project context. Here's something unexpected: I accidentally pasted a prompt from a different project during one session, and Auggie caught it immediately. "That doesn't seem related to your current codebase," it said, asking for clarification. Claude Code would typically just run with it, trying to solve whatever you throw at it regardless of relevance. This kind of sanity checking shows Auggie isn't just indexing your code—it's actually understanding your project's boundaries and purpose.

The Early-Stage Reality

Let's talk about what's missing, because these limitations matter.

No image support. You can't paste screenshots or diagrams into Auggie. For visual debugging or UI work, this is a real problem. I couldn't show it what the broken D3.js rendering looked like.

Large text blocks fail. Pasting substantial code snippets or logs often breaks the UI. The input handling isn't as robust yet. I had to work around this by referencing files rather than pasting content directly.

No sub-agents means no delegation. While focused single-process work has advantages, complex projects that benefit from parallel exploration hit a wall. Claude Code can spawn specialized agents for testing while another handles implementation. Auggie can't.

Diff visualization is rough. Multiple users cite this as a dealbreaker, and I understand why. The diff rendering is primitive compared to Claude Code's mature implementation. You're often accepting changes on faith rather than clear visual confirmation.

The Remote Agent Question

Augment offers Remote Agents as a cloud-based feature—up to 10 parallel agents working 24/7 on your codebase. But these aren't integrated with Auggie CLI yet. They're accessible through VS Code, with CLI integration "on the roadmap."

This feels like a missed opportunity. If Auggie could trigger Remote Agents from the CLI, even without native sub-process support, it would address the parallel processing limitation.

The Verdict: Right Tool, Right Job

After that train ride success, I'll continue using Auggie for specific types of problems. When I need focused, deep work on a complex technical issue—especially one requiring comprehensive codebase understanding—Auggie excels.

For single-problem debugging sessions where context matters more than parallelization, Auggie often outperforms Claude Code. The D3.js issue that stumped multiple Claude MPM sessions fell to Auggie's superior context grasp in under an hour.

But Claude MPM remains superior for orchestrating complex, multi-faceted development where you want specialized agents handling QA, documentation, and implementation simultaneously. Different tools for different jobs.

Should You Try It?

If you're hitting context limitations with Claude Code, absolutely. The $50/month developer tier gets you about 600 messages—expensive for individual developers, but the Context Engine might justify the cost if you work with large codebases.

If you need multi-agent orchestration, stick with Claude Code or wait for Auggie's Remote Agent CLI integration (and frankly I suspect Augment Code will be adding multi-agent support in the relatively near future)

If you're frustrated with token management and context windows, Auggie's automatic indexing feels like magic.

Bottom Line

Auggie CLI is an impressive early effort that gets some fundamental things very right. The Context Engine's automatic codebase understanding is genuinely superior to anything else I've used. The single-process limitation forces focused problem-solving that can be surprisingly effective.

But it's still early days. The missing image support, text input limitations, and primitive diff visualization create real friction. The lack of sub-agent support isn't just a feature gap—it's a philosophical difference that limits certain workflows.

I'm keeping my Augment Code subscription. For focused, context-heavy debugging sessions, Auggie has already proven its worth. That D3.js problem would probably still be unsolved without it.

I’m not wondering whether Auggie can replace Claude Code—it can't, and that's not the goal. The question is whether its superior context understanding justifies adding this tool to your arsenal.

For me, solving that persistent D3.js issue in 45 minutes on a train answered that question.

Have you tried Auggie CLI? I'm curious how others are working around the sub-agent limitation. Hit reply and let me know if you've found creative solutions.

Related Reading:

How I Trained Augment to Run Git and GitHub - Teaching AI proper Git workflows
Why Augment Code Is Betting Beyond the IDE - Deep dive on Remote Agents and the future direction

I tried Google Opal so you don't have to

Robert Matsuoka — Fri, 01 Aug 2025 14:03:32 GMT

When Google announced Opal, their new vibe coder entry, I had to try it. The space currently has Bolt, Lovable, Replit, and V0. Lovable leads the pack with Bolt gaining ground thanks to its Figma partnership. V0 does interesting work with API-driven vibe coding and design system integration.

So what's Opal's angle?

I'm not sure, actually. It looks more like a no-code tool than a proper coder. Maybe that's the point—point-and-click coding. I gave it a detailed prompt to build a travel intelligence platform requiring vector-based semantic search, database integration for knowledge storage, real-time data collection from multiple APIs, and export functionality including PDF generation. The system needed to handle user profiles, routing optimization, and offline capabilities.

Opal created a basic visual workflow with input nodes for "destination" and output nodes displaying text suggestions. No database schema, no API integrations, no file export capabilities. When I asked it to use form controls instead of text prompts for user input, it ignored the request entirely.

What Opal actually builds

Google's Opal creates visual workflow diagrams rather than actual code files. Users describe their app in natural language, and Opal creates a node-based flowchart showing inputs, AI model calls, and outputs.

You never see traditional code in JavaScript, Python, or React. You can edit prompts within each node, but the underlying execution stays hidden. This positions Opal closer to automation tools like Zapier than to legitimate code generation platforms.

The workflow consists of input nodes for data entry, processing nodes for AI transformations, and output nodes for displaying results. Unlike competitors that produce downloadable, deployable code, Opal apps remain locked within Google's ecosystem with no export capabilities.

Here's what struck me: this misses recent advances in deployable-code workflows that the market has embraced. Every serious vibe coder now gives you actual code you can deploy, modify, and own.

The competitive landscape Google entered

According to reports from tech publications tracking the space, Lovable raised $7.5M in Series A funding and their GitHub repository shows substantial community adoption. They generate production-ready React/TypeScript code with full GitHub integration. Their "Select to Edit" feature lets you directly manipulate components, and Supabase integration provides authentication, databases, and edge functions.

Bolt formed a strategic partnership with Figma, allowing users to transform Figma designs into production apps with one click. This integration, powered by Anima, strengthens the design-to-development workflow without requiring acquisition costs.

V0 by Vercel has carved out a niche as a UI specialist, releasing its "v0-1.0-md" model via API for premium subscribers at $20-30/month. They generate actual React components with proper TypeScript definitions.

Replit Agent offers browser-based development across 50+ languages, leveraging bounty service data where users describe projects in natural language. Their $20/month Core plan includes deployment infrastructure.

Notice the pattern? They all produce actual code. Files you can download, modify, deploy, and scale.

What developers actually said about Opal

One Medium author built an AI calisthenics assistant in 10 minutes, praising the "frictionless sharing" but acknowledging limitations for complex applications.

A Hacker News user encountered what they described as "possibly the ugliest error I have ever seen" when attempting to build a complex community gathering app. While this represents one user's experience, it illustrates the challenges Opal faces when handling multi-faceted requirements beyond simple, linear workflows.

Developer blog posts consistently describe it as "Canva, but for mini-apps." That's misaligned with what many developers prioritize in production tools.

The pattern in early feedback suggests Opal serves as a tool for rapid prototyping and idea validation rather than serious development. Success stories focus on content creation tools, productivity apps, and educational helpers—all front-end focused applications without backend complexity.

Technical gaps that matter

The technical comparison reveals fundamental problems. Opal offers no API generation, database integration, server-side logic, or authentication systems. Compare this to Lovable's native Supabase integration, Bolt's Node.js server support, or Replit's multi-language API capabilities.

Opal apps remain trapped on Google's servers with link-based sharing only. No custom domains. No GitHub integration. No export options.

This isn't just a feature gap—it suggests a different understanding of developer priorities. Developers need more than visual automation—they need control, scalability, and code portability.

My assessment: Wrong market positioning

Google Opal misses advances that define the current vibe coding movement. We're not looking for visual workflow builders. We want tools that enhance our development capabilities while maintaining code quality and control.

Industry reports estimate the low-code/no-code market at various sizes, but the value in vibe coding specifically comes from tools that generate real, deployable code. Opal's visual workflow approach exists in automation tools and adds limited value beyond integrating Google's AI models into a flowchart interface.

For complete beginners and hobbyists, Opal might find a niche. It competes more with traditional no-code tools than development platforms. But without backend capabilities, code export, or production scalability, it suggests a different understanding of where this market has moved.

It's Google, so don't count them out entirely. But personally, I found it underwhelming on key developer criteria. If you want a flexible no/low-code tool for simple workflows, this might work for you. If you want to generate actual code you can deploy and scale, stick with the established players.

The industry has moved beyond visual builders toward real code generation. Opal misses this shift into a problem we've already solved better elsewhere

The Opal UX. With a random input box I couldn’t figure out how to delete…

Someone Will Find A Use For This Kind Of App (it took over 20 minutes to collect this information. Not optimal)

Warp Code - When Your Terminal Gets Ambitious

Robert Matsuoka — Fri, 27 Jun 2025 14:02:38 GMT

The Setup

I've been using Warp as my primary terminal for some time now. It's a good addition to anyone's desktop—fast, modern, with features that improve daily workflow (no more looking up command parameters - ever!). So when they announced Warp Code, their AI coding assistant, I was curious but also cautious. Terminal companies expanding into AI development tools? We've seen this story before.

But here's the thing: they got some basics right immediately.

What Works

The LLM access is solid. Claude Sonnet 4 and GPT-4o as options—no proprietary models, no compromises. You can switch between models per conversation, though I found myself sticking with Claude for most coding tasks. When I'm debugging TypeScript compilation errors or wrestling with Chrome extension manifests, I want reliable reasoning. Warp Code delivers that.

The integration feels natural within Warp's existing environment. Unlike bolt-on AI features that feel foreign, this builds on Warp's command suggestions and workflow integration. When you're deep in a debugging session, asking the AI doesn't break your flow.

It handles straightforward coding questions competently. During my testing session building a Chrome extension, Warp Code correctly explained manifest syntax requirements and suggested proper TypeScript configurations. For standard debugging workflows—understanding error messages, suggesting configuration fixes—it performs as expected.

The pricing sits at $20/month with an annual discount for basic access. There's also a $200/month tier for heavier usage, though most developers won't need it for occasional AI assistance.

The Fundamental Problem

Here's where things get interesting—and problematic.

Modality confusion is real. You're coding in an IDE or editor, but the AI assistant lives in your terminal. This creates constant context switching that breaks flow. When Warp Code suggests file changes, I still need to manually navigate to my editor, find the file, and implement the changes. It's functional, but it's not fluid.

The terminal isn't built for code review. Warp Code can generate solutions and debug problems, but reviewing complex outputs feels cramped in a terminal interface. Compare this to Zed, which embeds rich data viewers—interactive trees, formatted JSON, side-by-side diffs—directly in the editing environment. Warp Code shows you text. That's it.

Context handling has limits. The AI remembers our conversation within a session, but start a new terminal window and you're starting fresh. For ongoing development work, this means repeatedly re-establishing project context.

The Bigger Question

Testing Warp Code surfaced something I've been thinking about across all these AI coding tools: What differentiates them when they're all using the same frontier models?

Claude Sonnet, GPT-4o, the emerging reasoning models—the core AI capabilities are increasingly similar. The prompting strategies are converging. The basic workflows (chat, generate, iterate) are nearly identical.

The real differentiators are becoming infrastructure:

Tree-sitter parsers for code understanding
Semantic indexing for large codebases
Build system integration
Context management across sessions
UI/UX for complex data review

But here's the question that keeps surfacing: Where is coding actually heading?

If AI can handle more routine coding tasks autonomously, the primary interface might shift from "place where I write code" to "place where I direct and review code." That's a fundamentally different tool category.

Warp Code feels like it's positioning for this transition. Rather than trying to replace your IDE, it's building the command-and-control interface for AI-driven development. Issue commands, review outputs, iterate on requirements—all from the terminal where you're already orchestrating builds, deployments, and system interactions.

Bottom Line Assessment

Should you try Warp Code? If you're already using Warp and occasionally need AI assistance with coding questions, yes. The AI integration is competent, the model access is high-quality, and it enhances rather than disrupts existing terminal workflows.

Will it replace your primary coding setup? No. The modality gaps are too significant for complex development work. You'll still do your actual coding elsewhere.

Is it positioned correctly for the future? That's the interesting question. As coding becomes more agent-driven, the terminal might indeed become the primary control interface. Warp Code feels like an early experiment in that direction.

The real test isn't whether Warp Code works today—it does, for what it attempts. The test is whether their architectural assumptions about the future of coding interfaces prove correct.

Warp Code is available now with Claude Sonnet 4 and GPT-4o integration. $20/month with annual discount for basic usage; $200/month tier for expanded access.

Zen Coder: Decent Execution, Overengineered Experience

Robert Matsuoka — Thu, 26 Jun 2025 19:02:17 GMT

My friend Ophir passed on link to Zen Coder a couple weeks ago, and I’ve seen their active marketing frequently since then, so I was curious what the fuss was all about. I spent the morning using it to do a task that had been on my to-do for awhile: refactor my AI Code Review project—breaking down massive 1000+ line files into manageable modules. My take on Zen Coder? It works fine, but it's solving problems I didn't know I had while creating new ones I definitely don't want.

The full refactoring work is available on GitHub if you want to see what was accomplished.

What Works

The core functionality delivered. When I asked it to systematically break down large TypeScript files into logical units, Zen Coder handled it competently. It read my project's INSTRUCTIONS.md files correctly, generated clean code that followed established patterns, and maintained proper imports throughout the refactoring.

The "Repo Grokking" feature genuinely understands codebases. Unlike tools that work file-by-file, it demonstrated comprehensive awareness of my project structure. Made sensible decisions about module separation. Kept architectural consistency across multiple files.

One feature genuinely impressed me: the session summarizer. At the end of our work, it provided this kind of clear overview:

"We have successfully fixed the test failures in the codebase after merging the refactor/prepare-codebase branch into main. Fixed ModelInfoUtils.ts... Fixed OutputHandler.test.ts... Fixed ReviewExecutor.ts... Ran the full test suite... Committed and pushed changes..."

That's genuinely useful documentation—the kind developers want but rarely take time to write themselves. It understood scope, tracked what was fixed, presented it clearly.

The Complexity Problem

Here's where Zen Coder loses me: it's built around "multiple specialized agents" that supposedly handle different aspects of development. Coding Agent, Unit Test Agent, custom workflow agents.

But during my entire refactoring session, I never understood when I'd want a different agent or why switching would help. The system never explained the benefits or suggested optimal contexts for agent switching.

I used the default interaction for everything. It worked fine. The promised "agent orchestration" felt like marketing, not meaningful functionality.

UX Issues That Add Up

Several design decisions created unnecessary friction:

Terminal confusion: When Zen Coder launched commands requiring my input, it just waited. No indication it needed my response. I spent several minutes wondering if it was processing before realizing I needed to hit enter. Basic UX oversight.

Even worse, it had the annoying habit of executing commands in a terminal, then immediately closing that terminal and reverting to the previous one. So I couldn't see what happened in the terminal session. Want to check if that command worked? Good luck—the evidence just disappeared.

Permission redundancy: Despite approving changes through Zen Coder's tools, it kept asking permission to run those same tools. The "step limit" feature pauses after certain steps, which felt redundant when it's already asking for individual command approval.

Missing system instructions: Despite its repository analysis capabilities, Zen Coder repeatedly asked for help understanding my repo in new sessions. It offers manual instruction input, but there's no automatic way to read existing instruction files (INSTRUCTIONS.md, .cursorrules) that most developers maintain. If it can read my entire codebase, why can't it discover my documented conventions?

Plugin-based sessions: The UX reminded me of Zed's approach—session history with exposed artifacts—but felt clunkier because it's running as a plugin rather than native integration.

Market Reality Check

The research confirms my experience. InfoWorld's review characterized Zen Coder as "wet behind the ears," noting its innovations "aren't ipso facto better" than competitors generating correct code immediately.

Technical limitations include:

Repository analysis takes significant time with apparent full reprocessing on refresh
Agent repair capabilities limited to simple bugs versus complex whole-repo fixes
Error-corrected inference pipeline doesn't demonstrably outperform models that generate correct code initially

Team Features Nobody Asked For

Zen Coder's major differentiator is team collaboration—creating and sharing "Zen Agents" across organizations. The platform emphasizes workflow standardization and institutional knowledge transfer.

This sounds compelling in theory but reflects a fundamental misunderstanding of how developers work. We want tools that help us code better, faster, with fewer errors. We don't want to manage agent libraries or orchestrate workflow handoffs.

The pricing model—550 "Premium LLM Calls" per user per 24 hours at $19-39/month—adds complexity without clear value over straightforward subscriptions.

Different Tools for Different Jobs

Against established alternatives, Zen Coder's value proposition is unclear:

Regular Claude delivers the same underlying capabilities with less ceremony. Claude's coding assistance is excellent without agent orchestration overhead.

Cursor offers superior user experience, more intuitive interface, competitive pricing at $20/month. The IDE experience is simply better.

GitHub Copilot provides universal familiarity, seamless integration, enterprise trust through Microsoft backing. Does what most developers need without complexity.

Windsurf delivers more polished implementation, cleaner interface, competitive features at lower cost.

Augment Code operates in similar fashion to Zen Coder—remote agents, multi-step operations—but with a far more streamlined interface within VS Code. Where Zen Coder's agent orchestration felt like marketing theater, Augment's remote agents delivered practical value. I'll cover this in detail in a separate review, but the contrast is telling: similar underlying concepts, dramatically different execution quality.

Zen Coder excels at repository understanding and offers team collaboration features. But these advantages don't overcome the UX friction and implementation gaps.

Bottom Line

Zen Coder isn't broken—it's overcomplicated. The core technology works, agents produce quality code, repository understanding is genuinely useful. But it wraps these capabilities in unnecessary complexity that creates more problems than it solves.

The platform feels designed by people who think about AI development theoretically rather than practically. Real developers want tools that get out of their way, not platforms requiring them to think about agent orchestration and workflow management.

For teams satisfied with existing AI coding tools, there's no compelling reason to switch. The promised collaboration benefits don't materialize for most development workflows, and technical execution lags behind more polished alternatives.

If you're looking for AI coding assistance, stick with proven options. They deliver the same core benefits with less friction and more predictable experiences.

Zen Coder represents the kind of overengineered solution that happens when startups feel compelled to differentiate in crowded markets. Sometimes the simpler tool really is the better tool.

Based on hands-on testing with TypeScript refactoring and comprehensive market research. Your experience may vary.

I Still Hate PowerPoint. I Think I'm Going to Love Gamma

Robert Matsuoka — Fri, 13 Jun 2025 14:02:14 GMT

Eating Crow (Sort Of)

On Friday, I published a piece about why I hate PowerPoint and demonstrated how Claude AI can quickly turn outlines into functional slideshows. I stand by that approach during the development phase—when you're iterating on structure, having an interactive model that responds to prompts beats wrestling with slide templates every time.

But I'm eating a little crow today after actually trying Gamma.

Turns out, purpose-built AI designed around turning ideas into presentations might actually be how most people solve their presentation problems going forward. Who knew?

The Gamma Reality Check

I took the demo slideshow I'd created for one of my clients—the same one I'd used to demonstrate the Claude approach—and literally copy-pasted the text outline into Gamma without any modifications. That's it. No formatting, no style guidance, no design direction.

Gamma picked out the style, illustrations, and format entirely by itself. In literally no time, with zero effort on my part beyond the paste operation, it turned that raw outline into a genuinely stunning presentation.

Here's the actual result

Were I to give that presentation again, I'd likely use this version. The visual design, the flow, the polish—it's exactly what you'd want for a client-facing deck without the usual hours of formatting hell. Pretty impressive considering it had nothing to work with but plain text bullets.

The Refined Workflow

Does this mean always use Gamma? Not quite. Here's what I think the optimal workflow actually looks like:

Phase 1: Structural Development Use Claude (or similar) for rapid iteration on content and structure. The back-and-forth of refining bullets, reordering sections, and testing different angles works best when you can prompt your way through changes instantly.

Phase 2: Final Polish Once your structure is solid, tools like Gamma shine. Copy-paste your refined outline—no formatting required—and let it handle everything: visual design, illustrations, layout, typography. The fact that it made intelligent design decisions from nothing but raw text bullets is genuinely impressive.

The limitation remains: Gamma isn't as interactive for mid-stream edits. If you want to rearrange bullets or restructure sections, you're better off going back to the prompt-driven approach, then re-generating the polished version.

Another Gamma slide. Folks, I just pasted in the text content. That’s it.

What This Really Means

This experience reinforces something bigger: AI prompting over traditional tooling is clearly where UX is heading.

Instead of:

Opening PowerPoint
Choosing a template
Fighting with slide layouts
Manually formatting elements
Spending hours on visual consistency

We get:

Copy-paste your outline
Let AI make all design decisions
Get a polished result instantly
Focus purely on content quality

The fact that Gamma produced a client-ready presentation from nothing but raw text bullets—no style guidance, no formatting hints, no design direction—demonstrates how far this approach has already come.

Gamma represents the first wave of this shift, and while I don't know if it's the best solution, it's pretty impressive. More importantly, it validates the fundamental premise: when AI can understand intent and execute design decisions, the traditional software interaction model starts looking absurd.

Yes, Claude’s slideshow generator could build simple graphs. But nothing like this.

The Honest Assessment

My Friday point about Claude for slideshow development? Still valid for the messy, iterative work of getting your ideas straight. But Gamma has shown me that the final presentation layer is ripe for this kind of AI-first approach.

The real insight isn't that one tool beats another—it's that we're seeing the emergence of a genuinely better workflow. Prompt-driven content development feeding into AI-powered design execution.

That's not just an incremental improvement over PowerPoint. That's a completely different category of solution.

Bottom Line

I still hate PowerPoint. But I'm starting to love what comes after it.

The future of presentations isn't about better slide software—it's about describing what you want and having AI figure out how to make it compelling. Gamma gets us closer to that reality than anything I've used before.

Now I just need to figure out how to expense a Gamma subscription as "competitive research."

Zed Gets the AI Editor Right

Robert Matsuoka — Wed, 04 Jun 2025 19:10:23 GMT

Sometimes timing is a funny thing in tech. I just finished reviewing Factories.ai, which perfectly exemplifies what happens when you build tool-based experiences for AI workflows: clunky interfaces, fragmented interactions, the kind of UX that makes you wonder if anyone actually used the thing before shipping it.

Then along comes Zed.

I'll be honest: I wasn't familiar with this editor before diving in. But after spending serious time with it on a real testing framework conversion, I'm convinced they've cracked something fundamental about AI-first development environments. They didn't start with an editor and add AI—they started with the AI session and built an editor around it.

Unlike Zed's web-based editor, this very responsive and fast, made-up desktop editor allows complete window flexibility.

What Zed Actually Is

Zed positions itself as a task-focused, minimalist editor built around three configurable pane areas. You get an editor window (which you can dismiss entirely if you're doing pure AI work), a context-aware navigation pane that intelligently toggles between file trees, code outlines, and git status, and here's the key: a dedicated AI session view where the real work happens.

The interface feels deliberately uncluttered. After years of VS Code's extension sprawl and notification chaos, Zed's clean lines are refreshing. But don't mistake minimalism for lack of power.

The AI Experience: Claude-Quality Results

Zed offers multiple model choices, but I worked primarily with Claude Sonnet. The responses felt identical to Claude Code, matching the quality I've seen using Claude Code in VS Code. Same reasoning patterns, same ability to understand complex instructions.

I put this to the test with a real workflow. I fed Zed my Claude.md reference file (the one I've customized with project instructions, file references, and development workflows) and asked it to handle a testing framework conversion. Not only did it parse and follow the instructions correctly, but it seamlessly worked with the GitHub API to create issues, managed Git branching, and maintained context across a complex, multi-step process.

It did make one mistake initially, forgot to work in a feature branch, but Zed recovered smoothly by re-scoping all changes to the proper branch structure. A small miss, but handled well.

Nice recovery, Zed.

What impressed me most: when I asked Zed to commit changes, it automatically re-reviewed the project instructions I'd provided earlier. This is context management that Claude often misses—maintaining awareness of core project requirements across long sessions. The fact that Zed preserves and actively references foundational instructions shows sophisticated session state management.

Even more impressive: Zed took my workflow instructions seriously and kept working on CI/CD test requirements until they actually passed. Unlike Claude, which tends to give up when tests fail repeatedly, Zed persisted until it got them working. Good thing too—I'd been skipping those failing tests for weeks. Zed basically forced me to clean up technical debt I'd been avoiding.

In terms of raw AI capability, this is Claude Code territory. But the interface makes all the difference.

The Streaming Session Revolution

Here's where Zed fundamentally changes the game. Unlike static diffs, Zed streams each micro-change as a real-time breadcrumb trail. You're not just watching aggregate results—you're seeing the AI's thought process unfold sequentially.

Each edit appears granularly: "Updating function signature," followed by the actual change, then "Adding error handling," with its corresponding code. Unlike traditional editor-based changes where you can't trace which request triggered which modification, Zed's session view maintains perfect lineage.

But here's the killer feature: full terminal session capture. When Zed runs tests, builds, or any command-line operation, you see the entire terminal history embedded in the session. Each interactive run becomes a timestamped moment you can scroll back to and review.

This is profound. You're not just getting code changes—you're getting the complete development context. Test failures, build outputs, deployment logs, all integrated into a coherent narrative of what the AI actually did.

If I were Anthropic, I'd be looking at this interface as a superior paradigm to plain terminal interactions. The session view provides visibility that transforms AI coding from blind trust to informed collaboration.

One fix needed: scroll position jumps when reviewing past edits.

Streaming diff snippets.

Architecture Enabling AI Innovation

What makes this possible is that Zed didn't start with an existing editor and bolt on AI features. They built everything from scratch in Rust with AI workflows as a first-class consideration.

The team completely rewrote their approach from Atom's JavaScript/Electron foundation. Nathan Sobo's team implemented what they call a "custom streaming diff protocol that works with Zed's CRDT-based buffers to deliver edits as soon as they're streamed from the model." You see the model's output token by token, which creates the granular change tracking I described above.

Their GPUI framework treats the entire interface like a video game, rendering everything on the GPU. This isn't just for performance—it enables the seamless integration of terminal sessions, code changes, and AI interactions into a single, coherent stream.

By owning the full stack from Tree-sitter parsing to GPUI rendering, they could architect tight integration between their "fast native terminal" and "language-aware task runner and AI capabilities." This is why terminal sessions appear seamlessly in the AI session view rather than feeling like separate tools awkwardly connected.

The contrast with Atom is striking. Where Atom used "an array of lines, a JavaScript array of strings" for the buffer, Zed uses "a multi-thread-friendly snapshot-able copy-on-write B-tree that indexes everything you can imagine." This architectural foundation enables the real-time collaboration and streaming AI features that define Zed's experience.

Performance and Polish

Zed's Rust foundation delivers on its performance promises. The editor is genuinely fast—no spinning wheels, no stuttered typing, no lag switching between files. Coming from Electron-based tools, the responsiveness feels almost shocking.

I found myself closing the traditional editor pane entirely, working in a two-screen setup with git/tree view on the left and the AI session on the right. And here's the concerning part: I might prefer this UX to VS Code with Claude Code integration.

Also lots of little niceties: when you “copy relative path” in the file browser, it automatically adds it to context. Another is that it can allow changes and modifications to the stream and handle them automatically, integrate them into their workflow while then resuming the tasks they were working on before.

One UX improvement I'd love: show the current git branch consistently, regardless of whether you're in code view or git view. Small detail, but it matters for workflow awareness.

That said, I'm convinced enough that I'm already planning to use Zed for several upcoming projects.

Pricing Reality Check

Zed's pricing is refreshingly straightforward: $20/month for 500 prompts, then usage-based billing beyond that (at the time of writing). For context, I converted roughly 50% of the testing framework for a substantial codebase (47,000+ lines of TypeScript) using 150 prompts.

If you're doing heavy AI-first development—100% prompt-driven like I was testing—you'd probably exhaust the 500 prompts within a week. But for mixed human/AI workflows, this pricing feels sustainable and affordable.

I suspect they'll adjust pricing upward or offer higher-tier prompt packages as adoption grows. The current model feels designed to attract users rather than maximize revenue.

Bottom Line Assessment

Zed has earned a spot among my favorite new development tools. It might be superior to Augment, and it's definitely a better AI coding experience than traditional editor-based approaches.

The streaming session view alone justifies trying Zed. Watching AI development unfold with full context—code changes, terminal outputs, git operations—provides insight into AI reasoning that other tools simply can't match.

Is it perfect? No. The scroll behavior needs fixing, pricing will likely increase, and you're locked into their ecosystem rather than enhancing your existing workflow.

But they've solved a fundamental UX problem that most AI coding tools ignore: how do you make AI development visible, trackable, and collaborative rather than mysterious and opaque?

What This Means for AI Development Tools

Zed proves that AI coding tools need rethinking from the ground up, not just bolting AI features onto existing editors. The session-based paradigm offers advantages that traditional editor-based or terminal-based approaches can't match.

More importantly, it demonstrates that UX innovation matters as much as AI capability. Claude's reasoning power paired with Zed's interface creates a development experience that's genuinely superior to either component alone.

For developers frustrated with existing AI coding workflows, Zed deserves serious evaluation. For established players in this space, it should be a wake-up call about the importance of interface design in AI-first development environments.

Zed doesn't just add AI—it rethinks what working with it should feel like.

Factory AI CodeDroid: Promising Concept, Premature Execution

Robert Matsuoka — Wed, 04 Jun 2025 19:01:08 GMT

I went into Factory AI rooting for it—because the premise, frankly, is what the next era of agentic coding should look like.

Multiple specialized agents handling different aspects of development. A local MCP bridge connecting to your actual environment (something I use extensively in my own AI workflows). The promise of truly autonomous coding—not just assistance, but "set it and forget it" development where you hand off a task and come back to finished work.

After spending time with CodeDroid on a real bug fix for my AI code review tool, here's what I found: great architectural vision, flawed execution, and a product that isn't ready for serious development work.

What Factory AI Gets Right

The specialized agents concept makes sense. Instead of one monolithic AI trying to handle everything from coding to testing to documentation, Factory breaks this into distinct Droids: CodeDroid for implementation, Review Droid for pull requests, QA Droid for testing, and others for specific workflow stages.

The local MCP bridge approach clicked with how I already work. Having an agent that can actually interact with your local environment—run builds, execute commands, understand your project structure—feels like the natural evolution beyond browser-based coding assistants.

Most importantly, CodeDroid actually solved the specific bugs I reported. It understood my workflow for creating GitHub issues, worked properly in a feature branch, and delivered fixes for the problems I described. That's table stakes, but worth noting when so many AI tools struggle with basic task completion.

The CodeDroid Session UI

The Fundamental Architecture Problem

Here's where things get interesting, and not in a good way.

Factory markets itself as autonomous, but requiring users to manually switch between agent "modes" isn't autonomous behavior. A proper multi-agent system should have an orchestrator agent that takes your task, understands the full workflow, and delegates appropriately. It should hand off to documentation agents when specs need updating, QA agents when testing is required, research agents when context gathering is needed—and just return results.

Right now, it's more like an app switcher for different AI roles than a true autonomous system. You pick your agent like you're picking a tool from a menu—not assigning a task and letting the system run with it.

When I asked CodeDroid to handle my bug fix, it never automatically transitioned to testing phases or involved other agents. I got code changes but no validation that the broader system still worked. For something positioning itself as enterprise-grade autonomous development, this feels like a fundamental architectural oversight.

Speed That Kills Productivity

The performance issues are brutal. Compared to Cursor's sub-30-second turnarounds or Claude Code's near-instant drafts, Factory's response times feel stuck in staging. We're not talking about marginally slower—this was painfully, productivity-killingly slow.

The bigger problem is feedback. Factory's UX assumes you'll start a task and walk away, but even if that's the intended workflow, you need meaningful progress indicators. Instead, you get minimal feedback during long operations, making it impossible to tell if the system is working or stuck.

This might be token conservation at play—enterprise pricing models often encourage batched operations over interactive responsiveness. But when tools like Claude Code deliver near-real-time responses with superior quality, the value proposition breaks down.

I asked this question frequently.

UX Contradictions

The interface reveals more complexity than it should. If autonomy is the goal, the system shouldn't be asking me to watch a context panel like a flight dashboard. But that's exactly what Factory does—three separate windows (session, context, history) compete for attention when the promise is hands-off operation.

Basic usability issues compound the problem. The history window proved useless—I could never parse what tasks were actually running. Multiple times, dialog boxes for accepting changes were hidden behind other columns, leaving me confused about why the system appeared stuck.

When CodeDroid needed permissions for non-trivial operations, it required manual approval or would halt progress. Again, this contradicts the autonomous positioning. Either work in a feature branch where it can do everything safely, or build proper permission models that don't require babysitting.

Factory AI’s Column-Based UX (Session and context columns shown. The useless history column not shown. Note the “Accept” button. It’s sometimes behind the context column, making you wonder why things have paused…)

Execution Gaps That Undermine Trust

CodeDroid fixed the specific issues I reported, which is good. But it didn't catch other TypeScript compilation errors and linting issues that already existed in the codebase. To be clear—it didn't cause these problems, but it didn't identify them either. More problematically, it claimed successful builds and mentioned fixing these problems, but when I tried to actually compile and run the code, things failed.

I'm not sure how it tested its changes if the build was broken. Maybe that's QA Droid's job, but without automatic handoff between agents, these gaps become visible and concerning.

The system also failed silently on GitHub API integration. I have a specific workflow where my AI tools create issues using environment variables, but CodeDroid couldn't execute this despite acknowledging the limitation only after attempting the work. For enterprise-grade tooling, upfront capability assessment should be standard.

3 Column Interface with a code editor.

Quality Doesn't Match the Claims

Despite reportedly using Sonnet models, the output quality felt inferior to what I get from Claude Code or Augment Code. This might be prompt engineering, context management, or architectural overhead, but the results matter more than the underlying model choice.

When you're positioning as premium enterprise tooling with autonomous capabilities, the bar is high. Traditional "assisted" tools like Cursor feel more autonomous in practice because they maintain momentum, provide clear feedback, and deliver reliable results.

The Trust Problem

Autonomous coding only works if you trust the outcomes. And trust, in this space, isn't earned by marketing—it's earned by builds that run and issues that don't come back two days later.

When CodeDroid claims it fixed linting errors but leaves broken builds, when it reports successful testing without actually running tests, when it silently fails on API integrations—that erodes confidence rapidly. The fundamental promise is that you can hand off work and trust the results. But if you still need to verify everything manually, validate builds, and debug what actually happened, the tool hasn't delivered on its core value proposition.

How I'd Fix This

Look at what actually works in autonomous AI development. Codex Web gives you a clean dashboard view of progress with real-time updates. Jules keeps the interface minimal because the AI does the heavy lifting. Devin doesn't even have a traditional interface—you communicate via chat and it shows you results.

Factory has this backwards. They're exposing complexity that should be hidden while failing to provide the essential feedback you actually need.

If I were Factory, here's what I'd build: A dashboard that shows clear progress indicators and task delegation in real-time. Full autonomous operation within feature branches—no permission requests for basic operations. Automatic orchestration between agents without manual mode switching. And actual integration with workflow tools—not just API calls that silently fail, but native JIRA, GitHub Issues, and Linear integration that works reliably.

The current three-column interface feels like they're trying to show off their technical sophistication rather than solving user problems. That's not enterprise software—that's demo software.

Where This Goes From Here

Factory AI has raised significant funding ($20M+) and secured enterprise customers, suggesting the market believes in the vision. The multi-agent architecture, local environment integration, and autonomous positioning represent thoughtful approaches to where agentic coding should evolve.

But execution matters. Speed issues make the tool unusable for iterative development. UX contradictions undermine the autonomous claims. Quality gaps break trust when you most need reliability.

I see potential here, and hopefully they'll get there. The concept is sound, and some foundational pieces work. Until Factory closes its execution gap, it's not ready for prime time. The ideas are there. The trust isn't.

For teams evaluating agentic coding tools, stick with Cursor, Claude Code, or Augment Code until Factory addresses these fundamental execution problems. The promise of true autonomous development is compelling, but promise isn't product.

Tested on Factory AI CodeDroid beta with a TypeScript/Node.js project. Your experience may vary, especially as they iterate on performance and UX issues.

Why Voice-to-Text Finally Makes Sense for Developers

Robert Matsuoka — Tue, 03 Jun 2025 14:02:16 GMT

I know, I know. Another "game changer" article. We've been drowning in them lately, and frankly, most aren't. But stick with me on this one—because I think we might actually be witnessing something fundamental shift in how we interact with our tools.

My friend Ophir introduced me to SuperWhisper a few weeks back. It's not breaking news—many of you probably already use it or something similar. But after integrating it into my daily workflow, I'm starting to see something bigger happening here. Something that connects to what we've been exploring with agentic coding tools and the broader shift toward more natural human-computer interaction.

What SuperWhisper Gets Right

SuperWhisper is an AI-powered voice-to-text tool that runs locally on macOS. At its core, it's elegantly simple: hit a hotkey, speak, release, and your words appear wherever your cursor was. But the execution is what matters.

Accuracy that consistently delivers. I'm a fast typer—always have been. Never felt constrained by typing speed for either coding or writing. But SuperWhisper's transcription accuracy genuinely impresses. It handles technical terminology with precision, understands contextual nuances, and filters background noise effectively. More importantly, it maintains reliability when I switch between casual explanation and technical specifics mid-sentence.

Zero-friction activation. Control key brings up the interface, Control key closes it and processes. That's the complete interaction. No context switching, no separate applications, no workflow disruption. It integrates seamlessly into whatever you're already doing.

Local processing with robust privacy. Everything runs on-device using Whisper models. No cloud dependencies, no data transmission, no latency from API calls. Your thoughts remain precisely that—yours.

Where This Gets Interesting

Here's what I've noticed after a few weeks of use: I'm not just replacing typing—I'm changing how I think through problems.

Considering the practical implications, this technology represents more than efficiency gains—it fundamentally changes how we approach problem-solving. When I'm working through complex architectural decisions, I find myself speaking through the challenges while examining the code. The verbal exploration forces greater clarity than scattered written notes ever could. For longer documentation, the difference becomes even more pronounced—instead of starting and stopping to craft sentences mentally before typing, I maintain natural conversational flow that ultimately produces more accessible technical content.

The Bigger Picture: Prompting Over Tooling

This shift in how I think also signals something broader happening in the dev world. As AI assistants become better at understanding natural language requirements and generating structured output from conversational input, the friction between human intent and machine capability continues to decrease.

Think about it: we've already established that AI can handle complex coding tasks when given clear specifications. Tools like Cursor, Codeium, and others are getting remarkably good at understanding intent and generating functional code. The bottleneck isn't AI capability—it's our ability to communicate what we want effectively.

Typing, for all its efficiency, is still fundamentally constrained. We think faster than we type, and we speak more naturally than we write. When you can verbally describe a problem, provide context, and iterate on solutions through conversation rather than careful text crafting, the entire dynamic changes.

In Practice: What This Looks Like

I've been experimenting with voice-driven workflows across different scenarios:

Code documentation. Instead of dreading comment blocks and README files, I speak through the explanation while looking at the code. The natural flow of verbal explanation often catches edge cases and assumptions I'd skip in written docs.

Issue tracking. Creating GitHub issues or Jira tickets by speaking through the problem, reproduction steps, and acceptance criteria. The conversational format makes these artifacts more useful for both AI assistants and human collaborators.

Code review discussions. When leaving feedback on pull requests, speaking the comment and then editing produces more constructive, clearer feedback than starting from a blank text box.

Architectural thinking. This one surprised me. When working through system design decisions, speaking through trade-offs and implications helps identify weaknesses in my reasoning that pure thought or written analysis might miss.

AI prompt crafting. Here's a benefit I didn't anticipate: voice dictation fundamentally improves how I interact with AI tools. When typing prompts, there's always the temptation to be concise—your fingers get tired, you want to get to the point quickly. But with AI, detailed prompts consistently produce better results. SuperWhisper excels at cleaning up your speech stream, removing the "ums" and "ahs" while preserving your actual intent. What you're left with is precisely what you meant to say, as detailed as necessary. Since I'm not constrained by typing fatigue, I can provide all the context, examples, and nuance that lead to superior AI output. More detailed instructions mean better prompts, which mean better AI collaboration.

Limitations Worth Noting

SuperWhisper isn't perfect, and voice-to-text isn't universally better than typing.

Privacy considerations you might not expect. Here's something I discovered the hard way: while dictating this very article, I had another video meeting window open in the background that I'd forgotten to mute (there was nobody else on the call, I was just waiting for someone to show up. But still...). When I switched back to that window, SuperWhisper had helpfully transcribed the entire conversation for me. This isn't a flaw in the product—it's a reminder that frictionless tools demand heightened awareness. If it's audible, it's capturable.

The accuracy, while impressive, isn't 100%. Technical terms, proper nouns, and code-specific language still require cleanup. You'll find yourself developing verbal habits to work around transcription quirks—spelling out variable names, using "dot" instead of periods for method calls, that sort of thing.

Environmental constraints matter more than with typing. Background noise, other people talking, video calls—all potential disruptions. And there's still something to be said for the muscle memory and direct brain-to-keyboard connection that experienced developers have cultivated.

Should You Try It?

Simple answer: Yes.

SuperWhisper offers a free tier with 15 minutes of Pro features, then basic functionality forever. The Pro subscription starts at $8.49/month or $84.99/year for unlimited use and access to larger AI models. Even if you only use it for documentation and longer explanatory text, it'll pay for itself in reduced friction.

But the real value isn't just productivity—it's the shift in how you approach problem-solving and communication. When the barrier between thought and text disappears, you start thinking differently about how to structure explanations, document decisions, and collaborate with both human and AI partners.

The End of Typing?

Are we looking at the eventual obsolescence of typing for developers? Probably not completely. Code structure, symbol manipulation, and certain kinds of precise editing will likely remain keyboard-centric for the foreseeable future.

But for the growing portion of development work that involves explanation, documentation, specification, and high-level problem-solving? Voice is starting to make a compelling case. Especially as AI assistants become better at understanding natural language requirements and generating structured output from conversational input.

Voice isn't replacing typing—it's expanding what we mean by "input." And as AI systems grow more capable of parsing our intent, that expansion becomes a shift in workflow, not just modality. The real game changer isn't dictation. It's the frictionless path from thought to structured output.

SuperWhisper offers a generous free tier that includes core voice-to-text functionality across all applications, meeting recording and transcription, unlimited access to smaller AI models, custom prompt control, and email support—providing substantial value for everyday use without any monthly cost. For users requiring advanced capabilities, the Pro tier ($8.49/month, $84.99/year, or $249.99 lifetime) unlocks the full feature set including your own AI API key integration, unlimited access to both cloud and local AI models, automatic translation from any language to English, audio and video file transcription, and priority support. The annual plan includes two free months, while the lifetime option offers the best long-term value. Students can access additional discounts, and all paid plans include a 30-day no-questions-asked refund policy, making it risk-free to explore SuperWhisper's full capabilities.

Google Jules First Look: Not Ready for Prime Time

Robert Matsuoka — Thu, 22 May 2025 22:07:00 GMT

When Google announced Jules at I/O 2025, my reaction was split between genuine interest and a familiar sense of déjà vu. On one hand, a PR-based autonomous coding agent from Google sounds promising. On the other hand—another one?

Google already has Firebase Studio, which feels like v0 but somehow worse. Now they're entering the autonomous agent space with Jules, positioning it as a competitor to tools like OpenAI's Codex. Given Google's resources and engineering talent, I expected this to be a more polished entry into the market.

I was wrong.

What Jules Promises

Jules is designed as a PR-based autonomous agent that can read your codebase, understand issues, and implement fixes directly through GitHub pull requests. Powered by Gemini 2.5 Pro, it operates asynchronously in secure Google Cloud VMs, promising to handle everything from bug fixes to feature development. The concept is solid—point it at a problem, let it analyze your code, and watch it create a targeted solution.

The integration approach makes sense too. Rather than trying to replace your development environment, Jules works within the GitHub ecosystem you're already using. It's meant to be the autonomous contributor that never sleeps, handling routine fixes and feature implementations while you focus on architecture and strategy.

The Reality: Day One Disappointment

My first attempt was straightforward: fix a chatbox on a homepage that wasn't making requests to the OpenAI assistants API. Simple debugging task, clear scope, exactly the kind of work an autonomous agent should handle well.

Jules immediately hit heavy traffic issues. I wasn't alone in this experience—multiple users reported access problems due to overwhelming demand. The developer community echoed similar frustrations, with widespread reports of "System is experiencing heavy traffic" messages lasting for hours. One user described giving Jules "a big task" in the morning, then waiting "several hours" with only traffic warnings.

The system couldn't handle basic load, throwing "unexpected error" messages before I could even get started. When it finally responded, the conversation was more frustrating than helpful.

this was basically every other response…and yes, they ate up the 5 whole credits Google gave me for beta testing this…

I asked if it could read GitHub issues directly. The response was telling: "I am an AI assistant and I do not have the capability to access external websites, including GitHub. Therefore, I cannot directly read GitHub issues."

But..but… *it works by pulling code from Github!*

This fundamental limitation became a common complaint among early users. Multiple developers described giving Jules "pretty explicit instructions" about their codebase, only to have it "do what it wanted and failed," wasting one of their precious 5 daily credits on something they specifically told it not to do.

Missing the Mark on Basic Functionality

The limitations became more apparent as I continued testing. Jules asked if I wanted to enable notifications for when "a plan is ready or code is ready for review." This suggests it does eventually produce work, but the path to get there involves significant manual intervention.

The developer community revealed the broader scope of these issues. Users reported mixed results on code quality, with one comparison showing Jules writing 2,512 lines versus Codex's 77 lines for the same task. While more code isn't necessarily worse, users debated whether Jules was "overcomplicated with code duplication and unnecessary boilerplate."

Others hit timeout issues. "Jules was unable to complete the task in time" became a common complaint, with the tool failing to finish work within its allocated timeframe. The 5-tasks-per-day limit meant these failures burned through users' daily quota quickly.

Compare this to Codex Web, which reads your full codebase, understands context, and operates with minimal friction. Jules feels like it's solving the wrong problem—asking developers to become intermediaries in their own automation workflow.

What Google Got Wrong

Reflecting on this experience, the most concerning aspect isn't the expected growing pains of a beta release. Google's own FAQ acknowledges that task failures are common, citing "broken setup scripts or vague prompts" as typical causes. These operational issues are fixable with time and resources.

Rather, it's the fundamental design decisions that suggest Jules wasn't built by people who actually use autonomous coding tools regularly. If you're building a GitHub-integrated coding agent, direct repository access should be table stakes. Requiring developers to manually copy issue content defeats the entire purpose of automation—it's like building a self-driving car that needs someone to read the road signs out loud.

Google has the engineering resources to solve these problems. They have access to vast amounts of code and sophisticated language models. Yet Jules feels rushed to market, more concerned with matching competitor announcements than delivering genuine value.

The Broader Picture

This release highlights a concerning pattern in how big tech companies approach AI tooling. There's pressure to ship something, anything, to show they're not falling behind. But shipping broken tools that frustrate developers doesn't advance the field—it sets it back.

Firebase Studio suffered from similar issues. Instead of building something genuinely better than v0, Google shipped something that felt inferior. Now they're repeating the pattern with autonomous coding agents.

The autonomous coding space is heating up rapidly, with multiple players announcing similar tools within days of each other. Microsoft revealed its own background coding agent in GitHub Copilot at Build, and the community is already comparing capabilities. But this rush to market seems to prioritize announcements over user experience.

The developer feedback reveals this tension clearly. While some users found success with simple tasks, describing 5-minute fixes that "worked perfectly," others struggled with server slowness, task timeouts, and the tool ignoring explicit instructions. The 5-task daily limit means every failure hurts, burning through your quota with nothing to show for it.

Should You Try Jules?

Not yet. Save yourself the frustration.

Jules might evolve into something useful as Google addresses these fundamental limitations. The company has the resources and talent to fix what's broken. Some developers acknowledge it's "experimental/alpha" and expect improvements, noting that Google released it free specifically to gather feedback.

But the infrastructure problems combined with basic design flaws suggest Google underestimated both demand and the core requirements for autonomous coding agents. I'll revisit Jules when it can handle basic tasks without requiring manual workarounds and when the servers can actually stay online under load.

Until then, there are better autonomous coding tools available that actually work as advertised. The autonomous coding space is moving fast, with real innovation happening from smaller, more focused teams. Google's entry feels like they're playing catch-up rather than leading, and it shows in the execution.

The Agentic Coding Landscape: Part 2a - Tool Comparison

Robert Matsuoka — Tue, 20 May 2025 18:13:22 GMT

(note: this is part of a 5-part series, part 1 is here, part 2 here, part 2a here, part 3 here, and part 4 here.)

The agentic coding tools we've discussed all share the common goal of automating programming tasks, but they differ widely in their technical approaches and capabilities. In this section, we compare several major tools along key dimensions: the level of autonomy they offer, language and framework support, codebase navigation intelligence, CI/CD and DevOps integration, debugging abilities, and IDE compatibility. This side-by-side comparison highlights each tool's strengths and trade-offs.

Tools Compared: We'll consider GitHub Copilot, Cursor (AI Code Editor), Augment Code (enterprise-focused agent), Claude (Anthropic's assistant, in a coding context), and WindSurf (an audit-focused AI coding IDE). These represent a mix of individual-oriented and enterprise-oriented solutions at the forefront of AI coding assistance.

Autonomy Level

GitHub Copilot: Low. Copilot is primarily a suggestion-based tool – it completes code as you type, but it won't make changes on its own or decide to tackle tasks without being prompted. (There is an experimental "Copilot Labs" or GPT-4 powered chat that can execute simple tasks when explicitly asked, but it's very limited in autonomy.) Essentially, the developer is in the driver's seat for every change; Copilot is an intelligent autocomplete.

Cursor: High. Cursor was designed from the ground up for full task automation. In agent mode, Cursor can create new files, modify existing ones, and run code end-to-end with minimal human input. It will ask for confirmation for certain actions only if configured to do so, but otherwise it can take a prompt like "Build a simple web app that does X" and attempt to handle everything necessary (planning, coding, running) largely on its own.

Augment Code: High. Augment (geared toward enterprise) can handle multi-step projects with very little oversight. It's built to take on complex tasks end-to-end – for example, generating code, documentation, and tests for a new feature across a large codebase. Augment's agents are tuned for large-scale changes and can operate mostly autonomously once given a goal.

Claude Code (Anthropic): Moderate. Claude has some autonomy in that it can take a high-level request and produce a multi-step solution (and with its large context window it can work with a lot of code at once). However, Claude tends to be cautious and usually acts in a responsive role – it waits for the user to prompt each action. It doesn't, for instance, automatically refactor code unless asked. So while it can generate and suggest significant changes, it typically won't self-trigger major actions without user instruction.

WindSurf: High (with guardrails). WindSurf can operate autonomously on coding tasks (it can write code, modify it, run analyses, etc. without constant prompts), but it's usually configured with strict guardrails in enterprise settings. For example, WindSurf might perform multi-step code modifications but require human approval at certain checkpoints or log every action for audit. It's capable of autonomy similar to Cursor/Augment, but in practice it's often run in a semi-autonomous mode due to its focus on compliance.

Language & Framework Support

GitHub Copilot: Broad. Copilot supports dozens of programming languages – basically anything popular on GitHub, from Python to Go to Ruby to C. It has no specific framework specialization; it will try to auto-complete code in any context. Its quality is highest for well-represented languages (JavaScript/TypeScript, Python, Java, etc.) and common frameworks, simply because the underlying models (GPT-3/GPT-4) have seen a lot of such code.

Cursor: Broad. Cursor uses powerful base models like GPT-4 and GPT-3.5, so it's capable in many languages (Python, JS/TS, Java, C#, Go, you name it). It's not limited to a certain framework or language. Users have applied Cursor to everything from building web apps to doing data science scripts. Essentially, if the language is supported by the model, Cursor can work with it.

Augment Code: Broad (enterprise-focused). Augment is designed for large, professional codebases, so it focuses on the languages commonly used in enterprise environments: e.g. Java, C#, Python, C++, JavaScript/TypeScript. It also has knowledge of enterprise frameworks (like Spring Boot for Java, .NET frameworks, etc.). It shines in projects with complex architectures (microservices, distributed systems) where multiple languages/configs might be involved.

Claude Code: Broad. Claude, as a general LLM-based assistant, can handle essentially any programming language you throw at it. It's particularly good when you need reasoning about code or algorithms (e.g. explaining what a piece of code does, or writing an algorithm from scratch) – tasks where its large language understanding is useful. It doesn't have hard limitations on languages, though if you gave it something very obscure or domain-specific, it might not perform as well without fine-tuning.

WindSurf: Broad. WindSurf was designed not to be tied to one tech stack. It's used in enterprise environments where you might find a mix of front-end (JavaScript/TypeScript), back-end (Java, Python, C#), and config languages (YAML, JSON) all in one repository. Its strength is maintaining context across all these – it can work on a front-end file and a back-end file and understand the connections. Essentially, WindSurf can support any language that its underlying models and tools know, and its value comes from managing them together in a coherent way.

Codebase Navigation & Understanding

GitHub Copilot: Basic project awareness. Copilot itself doesn't have true project-wide understanding. It mainly looks at the current file and perhaps a bit of surrounding context or related files open in the editor. Copilot Chat (the chat interface in VS Code) can follow your instructions to open specific files and can maintain some context across them if you explicitly bring them into the conversation. But it won't proactively scan your whole repository or know the overall structure unless you navigate it.

Cursor: Strong. Cursor indexes the entire project workspace you have open. It will automatically retrieve relevant files when you ask a question or when a change in one file might affect another. For example, if you ask it to change a function that's called across multiple files, Cursor's agent will find all those references and potentially update them too. It's very good at "knowing" where things are in your project without you explicitly pointing it to each file.

Augment Code: Very Strong. Augment is built for huge monorepos and enterprise codebases. It uses advanced retrieval techniques (like vector databases for embeddings) to pull in relevant context from anywhere in the codebase. You can ask Augment a question about any part of the code, and it can find the answer even if it's buried in a different module or repository. If you request a change (say, renaming a function that's used in dozens of places), Augment can systematically update every occurrence across the entire codebase. This kind of holistic codebase reasoning is one of Augment's key selling points.

Claude Code: Strong in understanding, but not proactive. Claude's latest versions have a context window up to 100k tokens, which means in theory you could feed an entire repository (or very large portions of it) into Claude for analysis. It excels at summarizing code and understanding relationships when directed to do so. However, Claude doesn't on its own traverse the codebase; a user or wrapper tool needs to provide it with the relevant files. In practice, this means Claude can be used to read and explain or modify large code sections if prompted, but it doesn't have a built-in mechanism to crawl your project like Cursor or Augment do.

WindSurf: Strong contextual memory. WindSurf maintains a persistent conversational history of the entire project state as you work with it. It's aware of the project structure and tracks changes across all files. Essentially, it builds up an internal knowledge base of your codebase. Users have noted that WindSurf feels like it "knows" the codebase intimately – you can ask, "Where is the X functionality implemented?" and it will recall the file and even the code snippet from prior interactions. This is aided by WindSurf's emphasis on logging and memory: every interaction with the code is logged and can be referenced later, which acts like an extensive memory.

CI/CD & DevOps Integration

GitHub Copilot: Minimal. Copilot by itself doesn't integrate with continuous integration or deployment pipelines out-of-the-box. It won't, for example, automatically react to a failing build on GitHub and submit a fix. It's focused on the coding part within the editor. (GitHub is experimenting with things like Pull Request assistants that use AI, but those are separate from Copilot's core functionality.) For DevOps tasks like writing a deployment script, you'd have to prompt Copilot and it will help write the code, but it won't independently run or verify it.

Cursor: Moderate. Cursor's agent can execute shell commands as part of its operation (especially in its built-in IDE environment). This means you can have it run tests or builds as part of its workflow. For instance, you could instruct Cursor to "run the test suite and fix any failures," and it will attempt to do so – which is essentially a CI-like behavior happening locally. However, Cursor doesn't directly integrate with external CI/CD systems like Jenkins or GitHub Actions. It's more about automating those steps on your own machine. A power user could script Cursor to imitate a CI pipeline (run tests, fix, commit, etc.), but that's user-driven.

Augment Code: High integration. Augment, being enterprise-focused, is designed to slot into the development lifecycle which includes issue trackers and CI systems. It offers integrations with tools like Jira (for linking tasks or user stories to what the AI is doing) and can hook into CI pipelines. For example, Augment can be configured to automatically attempt a task when a new ticket is created, then open a Pull Request with its changes. It can also monitor CI results – if tests fail on the PR it opened, it can try to fix them. Augment's ability to commit code and open PRs by itself means it's already operating in the DevOps cycle (AI writes code, opens PR, CI runs tests on the PR, etc.). Enterprises can set it up so that Augment-triggered actions correspond to their normal development workflow, just with an AI doing some of the work.

Claude Code: Moderate. Claude isn't a full devops agent on its own, but it can be used in parts of that process. For instance, using Claude through an API, one could build a tool that has Claude propose code changes, then those changes go through the normal CI. Claude itself won't automatically deploy or run tests unless instructed. One concrete use is code review: Claude (via services like Claude's Slack or GitHub integration) can review a pull request and provide feedback or even code suggestions. This is part of CI/CD (the review step). But it's not known for automatically initiating deployments or interacting with CI servers without a wrapper. Think of Claude as a very smart assistant that an engineer can involve in CI/CD tasks, rather than an agent that plugs into the pipeline by itself.

WindSurf: High (audit-focused integration). WindSurf's design makes it suitable to integrate at key points in a CI/CD pipeline, especially for governance and quality checks. For example, a company could mandate that all AI-generated code changes go through WindSurf for logging and approval before being merged – effectively inserting WindSurf as a gate in the CI process. Because WindSurf logs every change and can enforce certain policies (like "if code is changed, ensure there are corresponding tests" or "flag any secrets in the code"), it can be part of an automated quality assurance step. It's less about WindSurf deploying code (it's an IDE, not a deployment tool) and more about it ensuring that whatever code the AI writes meets the organization's standards before CI/CD continues. In summary, WindSurf can be part of CI/CD in the sense that it provides the audit and compliance layer in an automated pipeline.

Debugging & Error Handling

GitHub Copilot: Reactive (on-demand). Copilot will happily suggest fixes if you prompt it with an error message or a failing test, but it won't proactively run your code or catch errors on its own. The developer needs to notice a bug and then ask Copilot (or write a comment) to get help. It's essentially an assistant for debugging, not an autonomous debugger. (For example, you might see an exception, then type a comment // why is this error happening? and Copilot might explain and suggest a fix.)

Cursor: Proactive debugging. In agent mode, Cursor can take the initiative to run code/tests and then react to the results. If an error occurs, Cursor will detect it and can immediately suggest or implement a fix in the next iteration. In practice, Cursor debugs by trial-and-error: it runs your code, sees the stack trace, decides what change might fix it, makes that change, and runs again. This loop continues until things work or it gets stuck. This means Cursor is quite good at automatically resolving straightforward bugs (especially regressions it introduced itself).

Augment Code: Comprehensive debugging. Augment can analyze complex stack traces or failure logs across a large system to pinpoint issues. Because it integrates with monitoring and logging tools (in enterprise setups), it could even be fed production error data to analyze. If Augment makes a change, it ensures tests pass by running them, and if not, it will try to fix the issue. It can also generate new tests if needed to cover a scenario. Essentially, Augment treats debugging as an integral part of its autonomous workflow – any time it writes code, it's verifying that code works and handling the errors if not. It's like having a full QA engineer and developer in one: find the bug, fix the bug, write a test for it, all done by the AI.

Claude Code: Analytical debugging. Claude excels at reasoning about problems when given information. If you provide Claude with an error message or a description of a bug, it will produce a thoughtful analysis and often a correct fix or at least a useful hint. It's like having a very experienced engineer you can consult: you still have to feed it the error and code, but once you do, its suggestion might even outshine what simpler agents do. However, Claude won't run your code to get those errors – you or a tool need to do that. It's not an automated debugger, but an excellent debugging consultant.

WindSurf: Mixed (interactive debugging). WindSurf itself can run code (since it has a built-in execution environment for analysis), but more importantly it logs everything and can reason about the state of the codebase over time. If an error pops up after some AI changes, WindSurf's logs help trace back what was changed. WindSurf's agent can then suggest a fix or revert a change. It also often enforces that any error must be resolved (it might not allow a commit if tests fail, depending on how it's configured). In essence, WindSurf's approach to errors is to catch them through thorough logging and require resolution – it's less freewheeling than Cursor (which just keeps trying), instead favoring a careful audit: "an error occurred here, these lines changed, let's analyze that." The debugging suggestions it provides are methodical, and because it's typically used in critical environments, it might stop and flag human attention if something truly unexpected happens.

IDE/Platform Integration

GitHub Copilot: Extensive integration. Copilot is available as an extension in all major editors: VS Code, Visual Studio, JetBrains IDEs (IntelliJ, PyCharm, etc.), Neovim, and more. It's basically everywhere developers already work. This ubiquity is one of Copilot's biggest strengths – you don't need to adopt a new tool or IDE; Copilot comes to you in the environment you're comfortable with. It also integrates with GitHub's interface for suggestions in PRs and such. In short, it's very easy to adopt because it likely supports whatever you're using.

Cursor: Self-contained IDE. Cursor is its own application – an AI-enhanced IDE based on the VS Code interface. So to use Cursor, you download their editor (available on Windows, macOS, Linux). For VS Code users it feels familiar (since it's a fork of VS Code), but it's not a plugin you can drop into, say, your existing IntelliJ. You have to use Cursor's editor to get the full experience. While Cursor's editor is quite user-friendly and continually improving, some developers prefer not to switch from their established environment. (Notably, Cursor's approach means it can tightly integrate the AI features rather than being constrained by another editor's extension API.)

Augment Code: Editor plugins (VS Code and more). Augment provides a VS Code extension and has support for JetBrains IDEs in preview. This means you can use Augment inside VS Code, which is very common in enterprise teams. It also has a web dashboard for things like analytics and oversight (for managers or leads to see what the AI has done). Augment's integration isn't as ubiquitous as Copilot's (for example, there's no Neovim plugin advertised and no direct integration into every possible editor), but covering VS Code (and Codespaces) captures a large portion of developers. The strategy is to meet enterprises where they are, which is often VS Code and GitHub.

Claude Code: Multiple interfaces, less IDE-specific. Claude is primarily accessed via a chat interface (Anthropic's web app) or an API. It doesn't have an official dedicated IDE plugin (though the community has built unofficial plugins to use Claude in VS Code, etc.). Many users interact with Claude through chat – they paste code or ask for snippets, then copy the results back into their editor. This is starting to change as third-party tools integrate Claude (for example, there are browser extensions or VS Code extensions that let you use Claude with an API key). But compared to Copilot or Cursor, Claude isn't as seamlessly integrated into coding environments by default. It's more editor-agnostic: you can use it wherever via API, but it might not feel as "built-in" unless you do some setup.

WindSurf: Dedicated IDE application. WindSurf comes with its own custom integrated development environment/workspace app. It's somewhat akin to Cursor's approach in that you use WindSurf's tool as your coding interface. This IDE is tailored for code audit and collaborative AI workflows – it might look and feel a bit different from VS Code (though it reportedly mirrors many common IDE features to reduce the learning curve). There isn't a plugin to use WindSurf's AI in another IDE; the value of WindSurf is really in its whole platform (the way it tracks and manages AI interactions). Large organizations might use WindSurf as a standalone secure coding environment for their developers when working with AI assistance, rather than each developer using their own editors.

Summary of Comparison

In essence, GitHub Copilot sits at one end of the spectrum as a low-autonomy, highly-integrated assistant – it's very easy to use in any editor, but it relies on the developer to drive every change. On the other end, Cursor, Augment, and WindSurf represent high-autonomy solutions: they actively execute and manage coding tasks. Within these, Cursor is geared more toward individual power users and small teams (giving a lot of autonomy on local projects), Augment is aimed at complex enterprise codebases (providing end-to-end automation with enterprise integrations), and WindSurf focuses on governed enterprise use (high autonomy but with compliance and oversight built-in). Claude falls somewhere in the middle – it offers powerful reasoning and can handle large context, and through its API it can be integrated in various ways, but it is not as action-oriented on its own; it often needs to be paired with some platform or scripting to actually execute changes it suggests.

This comparison can help clarify which tool might fit a given scenario. For example, a freelance developer working on small projects might prefer Cursor for its strong autonomy and local control, allowing rapid prototyping without infrastructure overhead. In contrast, a large financial institution might lean toward WindSurf or Augment to harness AI assistance while maintaining strict oversight, security, and integration with their existing dev workflows. GitHub Copilot remains a great general aid for any developer who just wants smarter code completion within their familiar tools, whereas something like Claude could be an excellent "thinking partner" for complex algorithmic challenges or code review feedback.

As the field advances, we can expect these distinctions to blur – each of these tools is rapidly evolving and adding features. Copilot, for instance, may add more autonomous capabilities, and tools like Cursor are working on broader integrations. But currently, each has carved out its niche. Developers and teams can choose based on their priorities: convenience and integration (Copilot), maximal automation (Cursor/Augment), trustworthy oversight (WindSurf), or deep reasoning (Claude). By understanding these differences, one can make an informed choice or even combine these tools to cover all bases in software development going into 2025.

Enter Codex Web

Robert Matsuoka — Tue, 20 May 2025 13:15:41 GMT

OpenAI quietly rolled out Codex Web last week, gradually enabling it for pro and teams users. I spotted the new "Codex" option under my ChatGPT menu today and decided to put it through its paces on my personal website matsuoka.com (more on that project later). After several hours of testing, I've got some thoughts on what works, what doesn't, and where this fits in the increasingly crowded AI coding space.

Yet Another "Codex"

First, let's address the confusion. "Codex" now refers to at least three different products from OpenAI:

The original Codex model powering GitHub Copilot
A CLI-based coding tool I wrote about previously
This new web-based autonomous coder

The naming overlap is getting ridiculous, but I'll focus on OpenAI's new offering here. Worth noting that while Codex CLI is open source and lets you pick your own model, Codex Web uses whatever model OpenAI provides (likely o3), giving you less flexibility but more integration.

What Sets Codex Web Apart

Unlike IDE plugins (Copilot, Codeium), editor extensions (Cursor's Augment), or CLI tools (Claude Code), Codex Web seems to be a standalone web application focused on autonomous coding. The closest comparisons would be Cognition Labs' Devin or SweepAI - tools designed to autonomously implement entire features or fixes.

Setup requires:

Device 2FA authentication
GitHub repository access (granted per repo)
Manual entry of environment variables and CI/CD scripts

Once connected, Codex scans your codebase and suggests tasks it can perform, from explaining code structure to fixing bugs it identifies.

The Surprising Part: It Actually Found Real Bugs

Scanning my personal website repo, Codex Web immediately identified several legitimate issues, including:

A stray terminal string at the end of build.js
A useToast implementation attaching state listeners on every render
An import referencing a non-existent file (simple-mock-translations vs. the correct mock-translations.ts)

These weren't just linting issues but actual functional bugs that would impact performance or break builds. I was genuinely impressed by this discovery capability.

The Execution Model: Both Powerful and Problematic

When Codex tackles a task, it spins up a virtual environment to actually execute your code. This approach has profound implications:

The Good:

It can verify changes work before committing them
Execution reveals runtime issues static analysis might miss
No worrying about breaking your local environment

The Bad:

No explicit workflow configuration or instruction support
Each task is its own separate thread, which means checkpointing is more complex. This is an intentional design choice as checkpointing is directly tied to git branches. Long term, this actually may be more natural -- but then that choice should be made clearer.
Environment setup overhead for each major operation

The Ugly:

No built-in preview option
Build cycles are excruciatingly long and may be completely unworkable for large codebases

Modern web development relies on hot module reloading and incremental builds that take milliseconds. Codex's sandbox approach turns this into minutes. The productivity gains from autonomy are partially offset by this execution overhead.

The GitHub Disconnect

Perhaps most puzzling was Codex's inability to access GitHub issues despite having repository access. When I referenced issue #44, it responded:

I ended up copying and pasting the issue text manually. This reveals OpenAI's approach: Codex Web can only access what's explicitly enabled, even when it technically has the permissions. This feels like working with oven mitts on.

Where Codex Web Excels and Falls Short

After testing various scenarios, here's my assessment:

Strengths:

Code understanding and bug identification is genuinely impressive
Autonomous implementation of fixes works well
Clean interface with minimal setup friction

Weaknesses:

Build process is painfully inefficient
Limited workflow integration (issues, CI/CD)
No ability to leverage CLI tools or custom scripts
Context is limited to your code, not research or external docs

The Bigger Picture: Three Competing Visions

The AI coding landscape is splitting into three distinct approaches:

Full IDEs (either on Web or as VS style desktop replacements) - Bolt, v0, Cursor, Loveable, Windsurf
CLI/Plugin Agents (Augment, Cline, Claude Code) that enhance your existing tools
PR-style agents (Codex Web, Devin) that autonomously create pull requests

OpenAI seems to be testing different approaches to AI coding - providing models for Microsoft's GitHub Copilot while building their own autonomous agent with Codex Web. But the current implementation feels caught between worlds - not seamless enough for daily coding, yet not fully autonomous enough to handle end-to-end workflows.

The Verdict: Promising but Not There Yet

Codex Web feels like a tech demo of what's possible rather than a refined developer tool. Its bug identification capabilities are impressive, and the execution model has potential, but the performance limitations and workflow gaps make it impractical as a primary development tool.

I'll keep it in my arsenal for occasional code review and maintenance tasks, but it won't replace my existing workflow anytime soon. For now, it's most valuable as a glimpse into an autonomous coding future that's still several iterations away from practical reality.

The most interesting question is whether OpenAI will address these limitations or if they're fundamental to their approach. The sandbox model provides safety and verification but creates friction that undermines the productivity promise of AI coding.

For developers already comfortable with their tools, Codex Web isn't compelling enough to switch - yet. But this space is moving incredibly fast, and I wouldn't be surprised if these limitations are addressed sooner than we expect.

What's your experience with Codex Web or other AI coding tools? I'd love to hear from others who've tried it in different contexts.

The Agentic Coding Landscape: Part 4 - Future Trends, Integrations, and Impact

Robert Matsuoka — Tue, 29 Apr 2025 03:39:20 GMT

(note: this is part of a 5-part series, part 1 is here, part 2 here, part 2a here, part 3 here, and part 4 here.)

Looking ahead to 2025 and beyond, agentic coding tools (AI systems that can autonomously generate, modify, and execute code) are poised to further transform how software is developed. Several key trends are emerging:

Multi-Agent Collaboration and AI-Only Teams: One evolution on the horizon is multiple specialized AI agents collaborating with each other (and with humans) to tackle different aspects of development. Instead of a single monolithic AI doing everything, we might have a team of AI agents each with a role – for example, a “front-end developer” agent, a “backend” agent, a “QA tester” agent, and a “project manager” agent that coordinates tasks. This concept is already being tested in research and products. Cognition’s Devin AI introduced a “MultiDevin” mode where one manager agent delegates tasks to up to 10 worker agents, effectively acting like an AI scrum team (Release Notes - Devin Docs). AI experts like Andrew Ng have noted that a multi-agent approach mirrors how human teams break down complex projects into subtasks for different roles, often yielding better results than a single agent handling everything (AI Agents With Low/No Code, Hallucinations Create Security Holes, and more | Andrew Ng | 135 comments). We could even imagine an AI stand-up meeting: agents discussing progress and handing off tasks to each other.

The implications of multi-agent systems are significant. In the not-too-distant future, an entire small software project might be executed by a set of AI agents with minimal human involvement – humans would just provide high-level goals and oversight. Agents coordinating with agents could drastically increase the scope of automation. For instance, a company could spin up an all-AI development team overnight to build a prototype, then have human engineers review and refine it the next day. This flips the paradigm to continuous development at blazing speed, with humans curating and guiding rather than writing every line.

Of course, multi-agent orchestration brings challenges too. The agents need to share context and avoid stepping on each other’s toes. One agent’s mistake could mislead another. Ensuring they all follow a unified goal and debugging an “AI team’s” decision-making are active research questions. Structured approaches are emerging to manage this complexity – for example, an AI project manager agent that maintains a global view to keep all the agents aligned. If done right, multi-agent collaboration could unlock tremendous productivity; if done poorly, it could lead to chaotic outcomes. Early tests are promising though: frameworks like OpenAI’s AutoGen and research projects like ChatDev have shown that multiple agents working together can solve problems more effectively than one agent alone (AI Agents With Low/No Code, Hallucinations Create Security Holes, and more | Andrew Ng | 135 comments).

Deeper Integration with DevOps and Production Systems: Agentic coding tools will increasingly integrate across the entire software development lifecycle, including DevOps pipelines and live production systems. We’re already seeing early signs of this:

CI/CD Autonomation: Continuous Integration services (like GitHub Actions) are starting to experiment with AI agents that automatically fix build failures or optimize pipelines. For example, if a CI test fails at 2 AM, an AI agent could detect the failure and open a pull request with a potential fix before the team wakes up. Microsoft recently teased an “Agent Mode” for GitHub Copilot that can autonomously iterate on code and address errors in a CI pipeline (GitHub Copilot introduces Agent Mode, teases its first autonomous ...)
Automated Deployment & Ops: Future AI agents might handle deployments and monitor applications in production. If an issue arises (say a spike in error rates or a performance dip), the AI could automatically roll back the deployment or even live-patch the running system. Imagine an agent noticing a memory leak in a microservice: it generates a fix, hot-patches the service, and writes a report of what it did – all without human intervention. This scenario extends the role of Site Reliability Engineering (SRE) to AI. We could see an “AI ops team” where one agent detects an anomaly, another diagnoses the cause, a third applies a fix, and a fourth communicates the update to humans.
Continuous Optimization: Beyond reacting to problems, AI agents in production could continuously refactor code and tune systems for better performance. For instance, an agent might periodically refactor code for efficiency or adjust cloud infrastructure (like scaling servers or tweaking database indexes) based on usage patterns. Early hints of this are in industry chatter – experts predict AI agents will soon handle tasks like “monitoring load balancers and fixing CI/CD pipelines autonomously,” reducing a lot of DevOps toil (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx) (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx).
DevSecOps and Compliance: Security and compliance checks can be enhanced by AI. Agents could scan new code for vulnerabilities or insecure patterns (like an automated security reviewer), and even fix them before merge. They could enforce compliance rules – for example, ensuring dependencies meet licensing requirements or that code handling user data conforms to GDPR guidelines – all as part of the pipeline.

As these tools integrate with DevOps, developers may shift into more of a supervisory role for the pipeline. The future might involve developers specifying high-level objectives (“deploy this service with 99.99% uptime and minimal cost”), and the AI pipeline figuring out how to achieve it by adjusting configs, running tests, optimizing code, etc. OpenAI’s recent release of Codex CLI (an AI coding agent that runs locally in the terminal) hints at this agentic ops capability – it introduced distinct “approval modes” that let developers decide how autonomously the agent can act (from just suggesting changes to fully executing commands) (OpenAI Codex CLI – Getting Started | OpenAI Help Center). In fact, Codex CLI can not only write and edit code but also run shell commands to test it, all within a sandbox on your machine. This kind of tool shows how an AI agent can operate the same interfaces a developer would – be it a CLI, a cloud dashboard, or a web UI – effectively acting as a virtual DevOps engineer. (OpenAI calls Codex CLI a “lightweight coding agent” that can read, write, and execute code on your behalf.

Advances in AI Models and Capabilities

The underlying AI models are rapidly improving, which directly expands what agentic tools can do:

Smarter, More Accurate Agents

New large models (like the successors to GPT-4 or Anthropic’s Claude 2) are reducing coding errors and better understanding intent. The percentage of tasks these agents can complete end-to-end is rising. For example, Cognition’s Devin AI in early 2024 could autonomously resolve about 14% of coding issues in a benchmark test (Cognition emerges from stealth to launch AI engineer Devin | VentureBeat). In comparison, other AI models at the time could only solve 2–5% (with GPT-4 around 1.7%) . This gap is illustrated in the chart below – Devin significantly outperformed Claude 2, an open-source LLaMA model, and even GPT-4 in that test:

(image) Autonomous issue resolution rates in SWE-Bench (open-source coding challenges) – Devin far outpaced other AI models (Cognition emerges from stealth to launch AI engineer Devin | VentureBeat).

As models improve, those success rates will climb higher. It’s plausible that within a couple of model generations, AI agents might handle 50% or more of routine coding tasks without human help. In other words, many tasks that still require a person in the loop today could soon be within the AI’s solo capability. The “edge of capability” is constantly moving outward – one security researcher described it as the point where you don’t realize the agent is out of its depth until it fails, and that edge keeps getting pushed further with each improvement (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx) (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx). Soon, AI might reliably handle all well-understood programming tasks, leaving only truly novel or complex architectural challenges to human engineers.

Larger Context Windows

Models like Anthropic’s Claude now support extremely large context windows (100k tokens, and future models may handle millions). This means an agent can “load” an entire codebase, plus related documentation and requirements, into context at once. The AI can have a holistic understanding of a software project. For developers, this is like having an assistant who has read your project’s every line of code and design doc. It enables questions like “If we implement feature X, what parts of the system will be affected?” to be answered in detail by the AI. Essentially, it brings us closer to an AI that can act like a software architect, considering system-wide implications of changes.

Multimodal Understanding

Future coding agents will incorporate more than just text. We already see hints: OpenAI’s Codex CLI can accept screenshots or diagrams as input to inform its coding We might soon have agents that take in design mockups (e.g. a drawn wireframe) and produce the corresponding UI code, or agents that watch a video of a user interacting with an app and then identify and fix the UX issues. By understanding not just code, but images, UI layouts, or even audio, an AI developer could bridge the gap between design and implementation. For example, a designer could sketch an app interface, and an AI agent would generate the React/Vue code for it. Or an AI could observe that users are clicking a certain button repeatedly and deduce there’s a UX pain point, then suggest a code change. Multimodal agents could tighten the feedback loop in development by linking user experience directly to code changes (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx) (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx).

Better Natural Language Understanding

As models get better at understanding nuance, we can give higher-level instructions. Instead of painstakingly specifying how to implement something, we might simply tell the AI, “Make this module more secure,” or “I need this service to handle 10× the current load.” The agent would then figure out a plan: e.g. add encryption here, more input validation there, or introduce caching and more efficient algorithms to scale. Communication with AI will become more like talking to a very competent senior developer – you focus on what you want, they figure out how to do it. This opens the door for people who aren’t expert programmers (product managers, domain experts, etc.) to directly guide development via natural language – albeit with oversight to ensure the AI’s approach is sound.

Long-Term Learning and Memory

Today’s agents have limited memory of past sessions. But we can expect agents to develop longer-term memory of a project’s history. For instance, an AI coding assistant that remembers decisions made weeks or months ago: “We tried approach A for this feature last quarter and it didn’t work well, so let’s not repeat that mistake.” With techniques like fine-tuning on project-specific interactions or new memory architectures, an AI could become a persistent team member that grows alongside the project (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx). This kind of continuity would make the agent more useful over time, as it accumulates context and lessons just like a human team member would.

Impact on Software Engineering Practices

The rise of these autonomous coding tools will likely reshape software engineering practices and team dynamics:

Shift to Specification and Review: If AI handles more of the coding, human developers may spend more time on writing precise specifications, test cases, and doing code reviews. We could see a development process that is “spec-first.” In an ideal workflow, developers write thorough design docs and tests up front (as is encouraged by methodologies like TDD), then the AI implements the code to satisfy those specs and tests. Humans would then review the AI’s code diffs and the test results, focusing on whether the intent was correctly realized. In essence, developers move up a level of abstraction: describing the “what” and validating the “what,” with the AI handling the “how.” This is like Test-Driven Development on steroids, with the AI filling in the implementation details.
Continuous Development (AI Pair Programming 2.0): We already have continuous integration and deployment (CI/CD); soon we might have continuous implementation. Codebases could be in a state of constant incremental improvement by AI agents. Some have called this the era of self-healing code or self-optimizing systems. The AI could continuously refactor and improve the codebase in the background, never stopping unless told to. Engineers would supervise this ongoing evolution, setting high-level objectives and constraints (“don’t break backward compatibility,” “optimize for latency under 100ms,” etc.). The result might be software that’s always getting a little better each day, without formal “sprints” for refactoring – the AI is always refactoring in micro doses.
New Tools and Ecosystems: We will likely see more specialized agentic tools tailored to particular domains. For example, an agent specialized in database schema migrations, or one for mobile app development, or one for writing unit tests. Major development platforms are also integrating these capabilities natively. Microsoft has hinted at deeper AI features coming to Visual Studio and Azure DevOps. GitLab is adding AI-assisted code reviews and pipeline management. The ecosystem will evolve such that coding without AI feels like working with one hand tied behind your back (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx). We may also see marketplaces or repositories for AI agent “skills” or plugins – e.g. an agent plugin that knows how to optimize SQL queries, which you can add to your coding agent’s repertoire.
Human–AI Collaboration Norms: The role of “AI pair programmer” could become formalized. We might have teams where each human developer is paired with an AI agent that knows their codebase and preferences. Daily stand-ups might include AI agents reporting on what they accomplished last night. (This is not as far-fetched as it sounds – some tools already auto-generate daily summaries of code changes; extending that to a spoken report via an AI persona is conceivable.) Culturally, teams will need to treat AI agents as part of the team. There are even reports of open-source projects listing AI contributions in their changelogs or acknowledging AI assistants as contributors. Questions like “Who is responsible for an AI’s code contribution?” and “Do we credit the AI?” will need clear policies.

Potential Challenges and Unknowns

While the outlook is exciting, there are important challenges and open questions to address:

Quality and Trust: Will we reach a point where AI-produced code is as trusted as human-produced code, especially for critical systems? Perhaps, but any high-profile failures (e.g. an AI bug causing a major outage or security breach) could slow down adoption. Likely there will remain an upper bound of risk tolerance – for safety-critical software (medical devices, aerospace, etc.), humans may insist on final sign-off or stringent validation for the foreseeable future. AI or not, accountability for errors will still lie with humans, so processes to verify AI output (tests, audits, formal methods) will be crucial.
Regulatory and Legal Questions: If an AI agent writes a substantial portion of code, who owns the copyright? Who is liable for defects? These legal questions are still being figured out. Regulators might introduce requirements for audit trails of AI contributions in industries like finance, healthcare, or automotive. We might even see certifications emerge for AI coding tools – imagine an FDA-style approval process for an AI developer to be used in medical software. Additionally, companies will need policies on things like AI-generated code containing open-source snippets (to avoid license violations), or ensuring AI doesn’t introduce insecure code.
Developer Job Market and Skills: As routine coding gets automated, the skills required for developers will shift. Entry-level software jobs might diminish or change in nature – new graduates may be expected to know how to effectively use AI tools rather than write every line themselves. There’s a positive spin: automating drudge work could free developers to focus on design, strategy, and creative problem-solving, making the job more enjoyable. But there’s a cautionary view: developers will need to continually upskill to stay ahead of what AI can do. Knowing how to prompt, guide, and verify AI might become as important as knowing how to code a particular algorithm by hand. Education programs are already starting to incorporate AI tool training into curricula.

In general, the industry sentiment so far is optimistic. AI coding agents are seen as powerful aids that can boost developer productivity and output. In a sense, they fulfill the long-held dream of higher-level programming: humans describe what they need in natural language, and the machine figures out the detailed steps to make it happen. We went from assembly language to high-level languages to no-code builders – and now to conversational AI development. Each step has abstracted away more low-level work. Agentic AI is the next step in that evolution.

Experts predict that by the late 2020s it will be standard for every developer to work alongside an AI agent, much like it’s now standard to use version control or CI/CD pipelines (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx) (Agentic Coding Tools: Capabilities, Pricing, and Effectiveness (Expanded 2025 Edition) - Part 4.docx). Once everyone has an AI co-developer, the playing field may level out again in terms of productivity (much like how having a laptop became a baseline – now everyone has one, so it’s not a competitive advantage). But in the interim, early adopters can have a huge edge.

Interestingly, these tools might enable entirely new kinds of software that were previously impractical. When AI agents can simulate entire user populations, generate dozens of variations of a feature, or automatically improve code over time, development starts to look a bit like an evolutionary process. For example, one could have an AI generate 50 different implementations of a feature, deploy each to a subset of users, learn from the results, and then combine the best approaches – a level of experimentation and optimization that no human team could practically do. Software could evolve more organically, with AI driving rapid iteration and selection of the fittest solutions.

In summary, the future of agentic coding tools points to a world where developers act more as architects, curators, and strategists, while AI agents handle much of the grunt work (and even some creative work) under human guidance. Software will be developed faster, more iteratively, and potentially with higher quality – if managed well. Collaboration will extend to AI teammates (and even multiple AI agents working together under our oversight). The boundaries of what a small team can build will expand dramatically, possibly ushering in an era of hyper-personalized software (lots of niche solutions that an AI can spin up on demand) and unprecedented innovation speed.

As one expert succinctly put it: “The role of engineers isn’t to fear AI, but to lead its integration into the systems we build.” (The Rise of AI Agents: How 2025 Will Transform Software Engineering) Those who embrace and skillfully guide these tools will define the next chapter of software engineering.

Visual Summary Table

Best Agentic Coding Tools by Target Audience / Use Case

User / Team Segment

Recommended Tools

Rationale / Key Benefits

Individual Developers & Freelancers

GitHub Copilot, Cursor (and Cline/Roo for power users)

Versatility & Productivity: Copilot integrates seamlessly into editors and provides instant code suggestions for many languages – great for quick wins in daily coding. Cursor (and similar IDE agents) goes further by automating end-to-end tasks in a project (running tests, creating files, refactoring) which is ideal for a solo dev wearing many hats. Power users who want maximal control or customization might explore Cline/Roo, as these offer deep automation that can be self-hosted and tweaked. Overall, these tools handle routine code generation so an individual can focus on more complex or creative work, effectively acting as an extra set of hands.

Startup Teams (2–10 devs)

Devin, Cursor, Replit Ghostwriter

Ambitious Autonomy: Startups need to move fast and build MVPs quickly. Devin’s autonomous project-building capabilities can bootstrap entire app components from just a prompt – potentially creating a rough MVP overnight. Cursor boosts each team member’s throughput (almost like adding extra junior developers who can take on grunt work), which is priceless for a small team. Replit Ghostwriter is great for startups, especially those using Replit for collaboration – it offers real-time AI help in a shared dev environment with minimal ops overhead. By using these tools, a startup could accomplish in days what might otherwise take weeks, giving them a competitive edge in iterating on product ideas.

Enterprise Development Teams

Augment Code, Claude (Claude 2), Windsurf

Codebase Mastery & Compliance: Large organizations deal with huge, complex codebases and strict processes. Augment Code shines in such environments with its deep codebase indexing and integrations (Jira, CI/CD), helping teams query and refactor large monoliths efficiently. Claude’s 100k context and advanced reasoning make it excellent for understanding big-picture architecture and generating documentation or test plans – it can act like an AI architect or systems analyst in the loop. Windsurf appeals to enterprises for its focus on auditability and governance; it provides an AI coding assistant with an “audit trail” of decisions, aligning with compliance needs. Together, these tools accelerate development in large systems while maintaining the oversight and reliability that enterprises require.

Frontend/Web Developers

v0 (Vercel), Bolt.new, Lovable

Rapid UI/UX Implementation: Frontend and web app devs benefit from tools that understand web frameworks and user experience. Vercel’s v0 (when available) is tailored for Next.js and modern React workflows – it can generate components and pages in the idiomatic way a Next.js developer would, speeding up UI development. Bolt.new provides a zero-setup, in-browser IDE powered by StackBlitz, enabling quick full-stack prototypes; a frontend dev can instantly spin up a new project with code generated for both client and server, and see it live. Lovable is almost a no-code tool – a frontend or product designer can describe an app in plain English and get a functional prototype with UI and backend; for a developer, it’s a way to get a first draft of the front-end and logic, which they can then refine. All of these reduce the boilerplate and setup time for web apps, letting developers focus on polishing user experience.

Data Science & Analysts (Coding for data/ML)

Claude, GitHub Copilot (Chat mode), Amazon Q

Assisted Scripting & Analysis: Data scientists often write scripts in Python/R or SQL queries – tasks well-suited for AI help. Copilot (especially in chat mode or Jupyter integration) can autocomplete code and suggest fixes as they explore data, which speeds up analysis. Claude’s large context window allows it to ingest entire datasets’ schema or long logs, and then answer questions or generate analysis code based on all that context – useful for debugging data pipelines or interpreting results. If working in the AWS ecosystem, Amazon Q (Amazon’s AI coding assistant within AWS Studio) is tailored for building data and ML workflows on AWS (Generative AI Assistant for Software Development – Amazon Q ...) (Accelerate analytics and AI innovation with the next generation of ...). It can help generate infrastructure-as-code (CloudFormation/Terraform) or Glue/SageMaker scripts by understanding AWS-specific contexts. These tools act like AI pair programmers for data work, letting analysts spend more time on interpreting results rather than writing boilerplate code for data cleaning or charting.

Low-Code / No-Code Creators

Lovable, Bolt.new

Natural Language Development: For entrepreneurs or designers with minimal coding experience, these tools enable app creation through plain language. Lovable turns plain English descriptions into full-stack web apps – it handles the database, backend, and frontend automatically. This lowers the barrier for non-engineers to bring an idea to life. Bolt.new, while requiring a bit more web familiarity than Lovable, allows trying out ideas without setting up a dev environment – everything runs in the browser with AI assistance. Together, they empower creators to realize software ideas without hiring a developer, by letting AI handle the heavy lifting of code. The key benefit is the ability to prototype and validate an idea quickly and inexpensively.

DevOps & SRE Engineers

OpenAI Codex CLI, Windsurf, Amazon Q

Automation of Ops Tasks: DevOps teams can leverage these tools to automate routine operational work. OpenAI’s Codex CLI is a powerful aid – it can generate shell scripts, Dockerfiles, CI configs, etc., and execute them in a controlled local environment ([OpenAI Codex CLI – Getting Started

Education (Students & Teachers)

GitHub Copilot, Replit Ghostwriter

Learning Aid & Sandbox: For students learning to code, an AI assistant can be immensely helpful. GitHub Copilot (available free to students and educators) provides on-the-fly code suggestions that can help overcome syntax struggles and suggest how to implement functions, acting like an interactive tutor (Github Copilot is free for maintainers of popular open source projects) (GitHub Copilot now available for teachers). It’s useful for learning by example (though students must be careful to understand suggestions, not just copy blindly). Replit Ghostwriter offers a safe sandbox environment: a student can code in the browser and get immediate AI help. Ghostwriter can explain errors and even auto-fix code, almost like a teaching assistant that’s available 24/7. Teachers have used these tools in the classroom to demonstrate concepts – for example, asking the AI to solve a problem and then discussing its solution and mistakes. The goal here is accelerating the feedback loop in learning: students get un-stuck faster, and can experiment more freely, while teachers guide the conceptual understanding and proper use of AI.

These tables provide a quick guide to which tools shine for which scenarios. Of course, many tools are versatile and can serve across multiple categories, and many developers will use a combination of them. The landscape is rich, and choosing the right tool depends on the specific context, goals, and constraints of the user or team.

As the field continues to progress, expect these comparisons to evolve – new tools will emerge, pricing will shift (likely become more competitive or usage-based), and certain tools will broaden their target audiences. The key is to stay informed about each tool’s capabilities and to pilot them to see which aligns best with your needs.

Conclusion

Agentic coding tools represent a fundamental shift in software development. They have the potential to dramatically increase productivity while also changing the day-to-day workflow of developers. Our expanded survey covered not only major players defining this space, but also emerging tools pushing the boundaries (like Cognition’s Devin and OpenAI’s Codex CLI), integration with DevOps, impacts on team roles, pricing models, and ethical considerations.

A few key takeaways:

Productivity Gains: When used effectively, these tools can yield significant productivity improvements (some early case studies report 2×–5× faster completion of certain tasks). Entire features or projects can be completed in a fraction of the time they used to take. This amplifies what individual developers and small teams can achieve, allowing even startups or solo devs to tackle bigger problems or more projects in parallel.
Changing Developer Roles: Rather than replacing developers, AI agents are augmenting developers and shifting the nature of their work. The emphasis is moving toward higher-level thinking: planning, architectural decisions, testing and validation, and orchestration of components. Developers are increasingly in a supervisory or “editor-in-chief” role – they guide the AI with prompts and specifications, then review and refine the outputs. Mastery of “prompt engineering” (i.e. communicating intent to the AI) and configuring these tools is becoming a valuable skill, akin to knowing your development environment and libraries.
Tool Differentiation: There is no one-size-fits-all AI coding tool. The tools differ in focus and ideal use cases. For example, Cursor acts as an autonomous pair programmer deeply integrated in an IDE, whereas GitHub Copilot is a lightweight suggestion engine for code completions. Augment focuses on repository-wide intelligence and enterprise integration, while Windsurf prioritizes auditability and safety for enterprise use. Understanding these differences is crucial to selecting the right tool for a given project or team culture. In practice, many teams use a combo (e.g. Copilot for inline suggestions and an agent like Devin or Cursor for larger tasks).
Real-World Validation: We are starting to see real-world studies and reports validating the impact of these tools. For instance, a case study at a large automotive company found that using GitHub Copilot increased developer throughput and code quality on certain tasks ((PDF) The impact of GitHub Copilot on developer productivity from a ...). Many startups have publicly shared that they could iterate and pivot faster thanks to AI pair programmers. The flood of investment into this sector (hundreds of millions in funding, and some big acquisitions rumored) underscores that industry leaders believe agentic coding is not a fad but a core part of the future of software development.
Risks and Mitigations: Alongside enthusiasm, there are genuine concerns. Security issues like prompt injection (where an AI can be tricked into executing malicious instructions) and leaking sensitive code/data are important to guard against (Secure Code Warrior on X: "Are you aware of the security risks ...). Quality issues like subtle bugs or AI “hallucinations” (making up code that doesn’t actually work) require that developers stay vigilant with testing and code reviews. Ethical concerns, including bias in AI suggestions or the impact on developer jobs, need proactive management. Mitigating these risks will require a combination of: tool features (for example, sandboxing execution, or providing transparency into the AI’s decision process), best practices on the team (code review remains essential, as does thorough testing), and possibly new organizational policies or guidelines for AI usage. When those safeguards are in place, the benefits of speed and assistance tend to far outweigh the downsides – and surveys show a large majority of developers are excited about having AI help (one survey by Salesforce found 96% of developers expect AI agents to improve their workflow (AI Agent Adoption in Software Development: A Reality Check | by Herbert Moroni Gois | Mar, 2025 | Medium) (AI Agent Adoption in Software Development: A Reality Check | by Herbert Moroni Gois | Mar, 2025 | Medium)).
Future Outlook: The momentum suggests we are only at the early stages. As models improve and tools integrate more deeply, we might see a paradigm where “coding” is less about typing syntax and more about orchestrating intelligent agents. This doesn’t diminish the importance of human creativity, intuition, and judgment – if anything, it elevates those aspects. Future developers might spend less time debugging null pointer exceptions and more time brainstorming features or designing user experiences, with the AI handling the mundane bits. Coding could become a higher-level creative collaboration between humans and AI, which is a very different picture from the solo coder agonizing over boilerplate late into the night.

The relationship between developers and code is being redefined. Agentic AI is turning what used to be a labor-intensive process into a highly automated, interactive, and even creative partnership between human and machine. Many companies adopting these tools describe them as “teammates” or “co-pilots” – albeit ones that need guidance and oversight. The overwhelming positive reception from developers so far suggests that, far from feeling threatened, most see these AI agents as empowering tools that free them from drudgery and let them focus on the more fulfilling aspects of building software.

Organizations that effectively integrate these tools stand to gain significant competitive advantages through accelerated development cycles, improved code quality, and the ability to tackle ambitious projects with leaner teams. Meanwhile, developers who embrace and learn to harness agentic tools are likely to advance faster in their careers, as they can deliver more value and adapt quickly to the evolving technical landscape.

Software development has always been about extending human capability through better abstractions and tools. Agentic coding is the next giant leap in that evolution – blending artificial intelligence with human ingenuity. By staying informed, practicing good oversight, and continuously learning, we can ensure this leap leads to software that is more robust, innovative, and developed at a pace once unimaginable.

Enter Codex CLI

Robert Matsuoka — Wed, 23 Apr 2025 19:03:22 GMT

I came across OpenAI's Codex CLI while researching a broader State of the Union on agentic coding—a topic that’s about to get a lot more public in the next few days. Codex wasn’t the headline of that work, but it stuck with me.

Quick note: there are multiple tools named "Codex." This piece is about OpenAI's Codex CLI, released in April 2025. It's a standalone, open-source command-line assistant with support for GPT-4 and other models. It is not the same as the earlier OpenAI Codex model (2021–2023, used in GitHub Copilot) or the community-built Codex CLI projects like the one by codingmoh that runs fully offline¹.

It dropped last week without much noise. No launch spectacle, no hype cycle. But it’s quietly one of the most thoughtfully engineered AI tools I’ve used this year.

Not because it’s novel. Because it’s practical.

Four Things Codex Gets Right

1. Model Choice With Real Leverage

Codex gives you access to multiple OpenAI models:

gpt-4-o-mini: cheap, fast, and surprisingly conversational
gpt-4.1: best-in-class on coding tasks, and more cost-effective than Claude 3.5 in real-world use

Having both means you can balance cost and capability based on what you’re doing—debugging a script, reviewing code, or generating config boilerplate.

But here’s the kicker: Codex CLI is open source and supports any provider.

Want to use Claude, Gemini, or local models via Ollama, LM Studio, or custom APIs? You can. The system is designed to be provider-agnostic. Model definitions live in your .codexrc file, and you can route prompts through whatever stack suits your workflow—whether that’s cloud, self-hosted, or air-gapped environments.

That flexibility turns Codex from a tool into a platform.

And if you're looking to save on usage, Codex also supports a --flex-mode flag. When enabled, it attempts to use lower-cost or opportunistically available compute—but with a tradeoff: it may occasionally interrupt long-running sessions if those resources become unavailable. Great for budget-conscious devs, but not yet fully production-safe.

A hint to OpenAI: --flex-mode is a clever idea. But when compute runs out, sessions shouldn’t just crash. Either exit gracefully, or pause until capacity is restored.

2. Big Context, Smart Scope

Codex reads your full codebase but doesn't send it all at once. It scopes intelligently, caches what it can, and only sends what’s relevant. This cuts down noise, improves response time, and avoids leaking sensitive data.

It behaves like someone who’s read your repo and knows what to ignore.

(It actually can read your entire codebase and act fully autonomously, a la Devin, but I haven’t had the courage to try that mode yet. "dangerously-auto-approve-everything" or "full-context" mode is available.)

3. Privacy That’s Built In, Not Bolted On

Local-first execution
Explicit file access
No telemetry or background logging by default

If you work in regulated or high-trust environments, these aren’t features—they’re prerequisites.

4. Thoughtful UX, Conversational Flow

Codex is non-blocking. Ask a question, keep working, get a response later. No lock-in. No modal dead ends.

And when you authenticate with OpenAI, it exposes its thought chain—a step-by-step breakdown of how it interpreted your request. Here’s what o4-mini said while resolving a style update, followed by another example of strategic decision-making:

"The user asked to ‘apply the same style to the playground pages,’ which likely means using the card style from the landing page. Since they didn’t mention adding breadcrumbs, I think it’s best to leave the child pages as they are... I confirmed the necessary Card components are imported... I’ll also remove any trailing spaces before committing the changes. Now it’s ready to go!"

That’s not just output—that’s understanding… or at least a better simulation of it.

Here’s another great example that showcases how it reasons through constraints and proposes a clear plan of action:

"I’m thinking it’s best to propose a revised plan: let’s roll back and start with something smaller instead. It seems like the changes to session.ts are too large, so I’ll check version control for the previous state because we don’t have that code locally. Given the constraints, I think the best course is to recommend to the user that we revert, and open a feature branch to implement the smaller changes."

It’s this kind of structured decision-making that makes Codex feel like more than just a code generator—it’s a tool that reasons with the project in mind.

In Practice

Fixed Docker mount bugs in under a minute
Deployed hybrid Next.js + Edge Function apps with cleaner env config
Wrote simple Bash helpers faster than switching to a browser tab
Tested code changes with model commentary I could inspect and trust

Limitations Worth Noting

Codex isn't a full agent, and it has some clear boundaries:

It doesn't do live web searches, which can be limiting when troubleshooting unknown errors or dependency conflicts.
It relies on local context only—you get what you give it.

The biggest issue with gpt-4.1 right now is instability around file editing. In many sessions, Codex will report successful patching, but changes don't actually get written to disk. This inconsistency can break workflows without clear feedback, making gpt-4.1 frustrating for multi-step code refactors.

There is also a known Node.js warning that can occur during heavy usage:

(node:20662) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. MaxListeners is 10. Use events.setMaxListeners() to increase limit

While this isn’t the cause of the main crash behavior, it's a symptom of process saturation that may require restarting the CLI.

My current takeaway: while the gpt-4.1 integration is promising—and still useful in situations that require deeper contextual understanding or multi-file reasoning—it's also buggy in key areas, especially around file edits and process stability. In contrast, I was pleasantly surprised by how well gpt-4-o-mini performed. It took a bit longer to work through complex issues, but it was persistent, accurate, and refreshingly reliable.

It’s also worth saying: Claude still has legs, particularly when viewed through the lens of Codex’s limitations—most notably its lack of web search capabilities. While Codex outperforms it in structured coding workflows, Claude’s ability to browse the web makes it a better general-purpose assistant. For problems that required research, external docs, or integrating with web scrapers, Claude’s open context and tool access were invaluable. It’s also worth noting that Claude feels more polished in terms of UI and integrated tooling—particularly for users working outside the CLI or across multiple workflows. It’s still part of my toolbox—just not always my first stop.

Should You Use It?

Yes.

Codex CLI isn’t a gimmick. It’s not trying to be your copilot. I first encountered it while preparing a broader State of the Union on agentic coding—and it fits squarely into that evolving category: tools that can reason, act incrementally, and stay out of your way. It’s an AI-native command-line assistant that makes you faster without making a fuss:

It gets the model selection right
It respects your environment
It explains itself just enough
It lets you work without waiting

And in the broader context of agentic coding—which is about to get a major spotlight—it’s a solid glimpse into what AI-native tools should look like: quiet, capable, and engineered with intent. Add on the affordability of 4o-mini, and this may end up being my daily tool.

Given its newness and rawness, maybe hold off on putting it against your toughest problems…but at least have it take a look at them.

Footnotes

Other "Codex" projects include:

OpenAI Codex (2021): the model behind GitHub Copilot, now deprecated.
Community Codex CLI: a local-first, offline-friendly command-line tool for coding assistance.
This article is focused on the OpenAI Codex CLI, released April 2025, available on GitHub.