The Age of the CLI, Part 2

From Nanny Coding to Fire and Check In

Jan 08, 2026

TL;DR

The REPL model (type prompt, watch agent work, repeat) is already obsolete for developers who’ve mastered specification-driven prompting
Human-on-the-Loop (HOTL) replaces Human-in-the-Loop (HITL) as the dominant paradigm: strategic direction instead of tactical supervision
With TxDD (Ticket-Driven Development), I’m seeing 90%+ task completion on scoped work in my own testing
Tools like Ralph, claude-flow, and Claude Squad enable multi-agent orchestration that runs for extended periods with minimal oversight
By mid-2026, “fire and check in” will likely become the default for serious AI-assisted development

I’ve been running claude-mpm for months now, enforcing strict delegation to agent teams. And I’ve noticed something that’s reshaped how I think about AI development: I don’t nanny code anymore.

Here’s what that shift looked like. Six months ago, I’d prompt Claude to refactor a module, then watch it work. It would start down a path I didn’t like. I’d interrupt. Redirect. It would drift again. More interruption. The task might take two hours, but I’d be actively engaged for most of that time—half writing, half supervising.

Now? Last week I spun up an agent team to restructure the authentication flow in a client project. Wrote a detailed spec with acceptance criteria, test requirements, and constraints. Fired it off. Checked Slack, made coffee, reviewed some PRs on another project. Came back 45 minutes later to a working implementation that passed all the tests I’d specified. The agents had made different architectural choices than I would have, but the code worked and met the requirements. That’s the shift.

The cognitive move from “supervise every response” to “trust the framework, check the output” took longer than I expected. But once you’re on the other side of it, the old REPL model feels like driving with your hands on someone else’s wheel.

This is Part 2 of my CLI series. Part 1 explored why command-line interfaces beat IDEs for agentic work. Now I want to dig into what comes next: communication models for AI engineering teams that run autonomously for minutes or hours, needing only strategic direction from humans.

HITL was never the destination

Human-in-the-Loop made sense when models were flaky. You’d prompt, watch, correct, repeat. Every response needed validation. The REPL pattern emerged from genuine necessity.

But we’ve been treating HITL like an end state instead of a stepping stone. The pattern assumes human attention is cheap and AI judgment is expensive. That ratio has flipped. My attention is now the bottleneck. The agent teams aren’t perfect, but they’re good enough that constant supervision costs more than occasional correction.

The alternative is Human-on-the-Loop (HOTL). You set objectives, define constraints, establish checkpoints. Then step back. The system runs autonomously with structured telemetry. You intervene when escalation triggers fire, not when you feel like checking in.

HOTL emphasizes what researchers call minimal trust surface: limit what agents can access, track commands and file edits and external calls, pause on unexpected conditions, run agent outputs through validation like CI/CD for human code. It’s not “set and forget.” It’s “set and verify.”

The LangGraph implementation makes this concrete. Their interrupt() functions pause graph execution mid-process, wait for human input, then resume cleanly. Strategic checkpoint placement matters: graph-level interrupts at predefined nodes, node-level interrupts for dynamic requests, approval loops before costly operations. The framework even supports time travel, rewinding to earlier states for alternative trajectories.

The 90% threshold and TxDD

Here’s my working theory on what makes autonomous operation actually work: Ticket-Driven Development. I call it TxDD. Want to see it in action? Look at some of my Github issues.

The old approach was vibe coding—throwing loose prompts at agents and iterating through failures. Works fine for small tasks. Falls apart when you want agents running for an hour with minimal oversight.

TxDD inverts this. Before any code gets written, you build out a structured ticket hierarchy:

Parent ticket with the overall objective and acceptance criteria
Sub-tickets breaking work into discrete, testable chunks
Each sub-ticket specifies success criteria, constraints, and validation steps
Dependencies mapped so agents know what to complete first

With this level of specification, I’m tracking around 90% task completion on scoped work in my own projects. That’s based on reviewing agent output against the acceptance criteria in each ticket—did it complete the sub-tasks, meet the constraints, produce working code? Not everything hits that bar. Complex architectural decisions still need human judgment. Multi-day autonomous operation still drifts. But for defined features, bug fixes, refactoring, documentation? The agents deliver more often than not.

A detailed case study of Claude Code integration into a 350k+ LOC codebase showed 80%+ of code changes fully written by agents with an estimated 30-40% productivity increase. Their key: explicit plan review gates, feature-based directory structure for context management, custom subagents for code review, fresh context per subtask. The methodology matters more than the model.

What people are building right now

The ecosystem for autonomous agent operation has exploded. Four categories worth watching:

Iteration loop enforcers

The Ralph plugin (officially ralph-wiggum, named after the Simpsons character who persists despite setbacks) implements a stop-hook pattern. When Claude would normally finish and return control, Ralph re-feeds the same prompt. The agent sees its previous work in modified files and git history. Progress persists through filesystem state, not conversation context.

Start a loop with /ralph-loop "Build the authentication system" --completion-promise "DONE" --max-iterations 20. The agent keeps working until it explicitly outputs “DONE” or hits the iteration limit. A community fork adds intelligent rate limiting, circuit breakers with error detection, and tmux session integration. It also catches agents spinning in failure loops with heuristics like MAX_CONSECUTIVE_TEST_LOOPS=3.

Multi-agent orchestrators

Claude Squad provides a terminal TUI managing multiple agents (Claude Code, Aider, Codex CLI, Gemini) in separate workspaces with dual isolation through tmux sessions and git worktrees. YOLO mode auto-accepts all prompts for hands-off execution.

The AWS CLI Agent Orchestrator formalizes orchestration into four modes: Handoff (synchronous task transfer), Assign (asynchronous parallel execution), Send Message (direct communication with running agents), and Flow (scheduled runs via cron expressions).

claude-flow goes further with a hive-mind architecture featuring 64 specialized agents and a “queen” coordinator. Its Dynamic Agent Architecture supports self-organizing agents with fault tolerance. Stream-JSON chaining runs 40-60% faster than file-based agent communication.

Commercial autonomous platforms

Devin operates through Slack integration. Message it like a colleague. Devin 2.0 added an Agent-Native IDE with embedded code editor, terminal, sandboxed browser, and smart planning tools. Enterprise customers like Nubank reportedly see 12x efficiency improvements on ETL migrations—though the answer.ai team’s rigorous trial found only 3 out of 20 tasks succeeded (15%). The gap between vendor case studies and independent testing remains wide.

Factory AI embeds task-specific “Droids” across the development workflow. Users assign GitHub issues to Factory; it pulls context and creates PRs automatically. The platform supports running 1000+ agents in parallel for migrations and allows fine-tuning how independently agents operate.

Communication protocols

The AG-UI protocol has emerged as an open standard, emitting ~16 event types during agent executions. Status streaming operates at three layers: token-level (LLM generation as it happens), node-level (state transitions between workflow steps), and custom streaming (progress during long-running operations).

The communication layer problem

Running agents autonomously creates a new problem: how do you know what they’re doing?

I’ve been experimenting with a pattern where I fire up CLI tools in a tmux session and have an LLM interpret and summarize results. The outer LLM acts as my translator. It reads agent output, compresses hours of activity into digestible summaries, flags decisions that need my input.

This is emerging elsewhere too. Factory AI’s research shows structured summarization retains more useful information than raw conversation logs. Jules from Google generates audio summaries of recent commits. Cline v3.15’s Task Timeline provides a visual storyboard where each block represents a key step from initial prompt through tool execution to file edits.

Dashboards are maturing through tools like Langfuse (open-source session tracking), Sentry AI Agents Insights (traces showing LLM calls and tool errors), and Mem0 (multi-level memory for user, session, and agent state).

The pattern I keep seeing: LLMs reading LLM output and telling humans what happened. That translation layer didn’t exist six months ago, and it’s becoming essential infrastructure. Without it, autonomous operation creates opacity—you don’t know why an agent made a decision until something breaks.

What works and what breaks

Let me be specific about where autonomous operation delivers and where it fails.

Works reliably:

Fixing merge conflicts, linter errors, and simple bugs with clear reproduction steps
Adding boilerplate code and writing tests for existing code
Refactoring within defined boundaries
Prototyping MVPs with clear specs
Documentation generation and updates

Still needs human oversight:

Complex architectural decisions
Enterprise-scale codebases (context limitations kill autonomy)
Multi-day autonomous operation (drift is real)
Anything requiring cross-system coordination
Security-sensitive changes

The Cerbos analysis calls this the “70% problem”: AI gets you most of the way, but the last 30% sometimes takes longer than doing it from scratch. I think that understates current capability for teams with good specification discipline. With TxDD, I’m seeing closer to 90% on well-scoped tasks—but “well-scoped” does a lot of work in that sentence.

Fire and check in becomes the norm

Here’s my prediction for 2026: “fire and check in” will likely replace REPL as the default interaction pattern for serious AI-assisted development.

Not “set and forget.” That’s too optimistic. But the ratio of human attention to agent execution time should collapse. Instead of watching agents work, you’ll:

Write detailed specs in structured tickets (TxDD discipline)
Fire up agent teams with clear objectives
Check in periodically for status summaries
Review completed work in batches
Intervene only when escalation triggers fire

Cursor 2.0’s Background Agents are already pushing this direction, running up to 8 agents simultaneously with OS notifications when agents complete or need input. Replit Agent 3 offers “Max Autonomy Mode” running up to 200 minutes continuously with minimal supervision.

The tooling exists. The communication layers are maturing. The missing piece is developer discipline: the willingness to invest upfront in specification quality so agents can run autonomously.

That’s the real skill transfer happening right now. We’re learning to manage AI engineering teams the way we’d manage human ones: clear objectives, defined constraints, trust but verify, intervene on exceptions rather than supervising every action.

What I’m building toward

My own workflow is converging on something like an engineering team I direct but don’t supervise. Claude-mpm handles the orchestration layer. TxDD specs define what “done” looks like. Agent teams run in tmux sessions. A summarization layer tells me what happened.

The part I’m still figuring out: communication abstraction. Right now I’m firing off specs and checking results manually. I want a pattern where I can queue tasks, get progress notifications, review outputs in batches, redirect when needed. Something closer to a project management interface than a chat window.

That’s what’s next after the REPL. Not faster autocomplete. Not smarter suggestions. A different relationship entirely: humans as strategic directors of autonomous AI engineering teams.

The agents are almost good enough. The frameworks exist. The remaining gap is process and tooling for that translation layer between human intent and agent execution. Whoever solves that owns 2026.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on this topic, see Part 1: Why CLI Beats IDE for Agentic Work or my deep dive into claude-mpm orchestration patterns.

Armin

Jan 11

Brilliant analysis. Thank you. My thinking is, that Kanban-Like visualisations, like in Vibe-Canban together with Cron Job like architecture will allow agents to own most of what humans do. But humans can easily take over if necessary. I'm thinking about it like a move to "Flight Controller" Mode. You call it "human on the job". 100% agree. Thanks for taking the time to find words, and to provide references. Very helpful

Ellie Daw

Jan 8

Thanks so much for these details, love it. Totally agree, functional observability is where we’re headed. Like a product manager who has verification along the way

Discussion about this post

Ready for more?