We’ve Turned A Corner

The anatomy of a 14-hour harness session — what a fully instrumented Claude Code orchestration run actually did, turn by turn, with a receipt for every move

Jun 17, 2026

I watched a Claude Code session run a multi-workstream engineering job last week, and I caught myself doing something I do not usually do with these tools: I stopped intervening. The orchestrator — the PM layer, running claude-opus-4-8 — handled the kind of coordination work I would expect from a team lead who has been on the project a year, decomposing it, predicting conflicts, routing to the right specialist. Not a faster coder. A technical lead.

This time the session was recorded turn by turn, with per-token cost telemetry on every move. So these are not impressions. Below are the specific behaviors I saw, described as precisely as the instrumentation allows, each one carrying its timestamp, the orchestrator’s own words, and what it cost. The conclusion I’ll leave to you.

The job ran on trusty-tools, a Rust workspace, on 2026-06-10. One human (me), one PM coordinator, a fleet of subagents, and a memory layer plus code search and an adversarial PR reviewer wired in over MCP. It ran for 14 hours and 37 minutes. I authored nine content prompts in that window. The longest was 84 characters.

TL;DR

A single Claude Code + claude-mpm session ran 14h 37m on a real Rust codebase and cost $485.96 at rack rate — $264.65 for the PM (claude-opus-4-8, 265 turns) and $221.30 for the subagents (sonnet at 4,626 turns, haiku at 605). It made 63 delegations. On a turn basis, 93% of the work was autonomous; my share was 7%.
I’m on Anthropic’s Max plan — $200/month flat. This session ran inside that subscription. The rack-rate figure is the right way to price the economic value of the work; the actual cost to me was zero marginal dollars.
That 7% was nine human-authored prompts in 14.5 hours. Several were one word — proceed, let's do the top candidates. The longest ran 84 characters. The orchestrator supplied the structure; I supplied the direction changes.
The economics work because of cache. The sonnet agents read 3,531 cached tokens for every fresh input token (482M cache-read against 137K fresh). The PM ran at 587×, haiku at 34,868×. Most of the context window each turn is priced as cache, not fresh input.
My idle time shows up on the bill. The most expensive PM turns are cache-cold context rebuilds after I left a long gap: $7.26 to resume after proceed (1.5 hours of silence), $5.82 to pick back up after a 4.5-hour wait.
The orchestrator ran adversarial reviews on its own engineers’ work 13 times, unprompted — “Per my verification ownership I won’t take that at face value” — and bounced a starred approval back for more work. It absorbed two infrastructure failures without escalating them to me. None of these behaviors is impressive alone. A competent IC does them all before lunch. They appeared together, in one session, with a cost attached to each.

You can explore the full annotated session timeline — a turn-by-turn companion infographic showing the user-visible output alongside the underlying mechanism for each move.

A note on what this is and isn’t

I am not claiming the model understands anything, and I am not grading it on a benchmark. I am describing observed behavior from one session, the way you would describe an animal in the field: here is what it did, here is the order it did it in, here is what it cost, draw your own conclusions. Twelve behaviors stood out. They are below, roughly in the order they appeared.

The Stack in November

This session would not have run the same way seven months ago, not even close. Three layers changed in that window, and they changed together.

Start with the model. The session above ran on claude-opus-4-8, released May 28, 2026. Seven months earlier the comparable model was Opus 4.5, which shipped November 24, 2025. Better code is the obvious place to look for the shift, but the one that mattered here sits elsewhere: 4.8 is substantially less likely to let a flaw in its own work pass unremarked — it surfaces its own errors instead of quietly shipping them. That is the difference between a contractor who does their best and one who tells you when he thinks something is wrong. The 1M context window also moved from beta on 4.5 to generally available on 4.8, which matters across a 14-hour run, and in Claude Code 4.8 defaults to xhigh effort — the cost figures above reflect that setting.

Then the Claude Code harness around it. In November 2025, Claude Code was effectively a single-session tool; background agents existed, but worktree isolation did not. WorkTree support is the critical unlock. Each subagent in this session ran in its own isolated git worktree — its own branch, its own working tree, no chance of stepping on a parallel agent’s files mid-run. The merge-conflict prediction in behavior #6 is only tractable because the agents writing code cannot collide on disk while they work. Parallel execution without merge confusion needed a harness feature that wasn’t there last November. Added in the same window: 5-level nested subagent hierarchies, claude agents for session-wide visibility, and Dynamic Workflows for the PM coordination layer.

The third layer is the orchestration framework. claude-mpm went from v4.26 to v6.5.44 over those seven months — two major versions, roughly 450 releases. Two changes carry most of the behavioral weight. trusty-memory became mandatory session context, so the PM reads project history before it does anything else, which is behavior #1 above. And worktree-first became the framework default, which is what enables behaviors #3, #6, and the parallel fan-outs in #9. The bench is deeper too: 57 specialized agents now versus about 30 in November, so the routing in behavior #9 has more specialists to route to.

None of these three improved in isolation. The model’s self-auditing is only useful if the harness can surface those audits inside a multi-agent pipeline. The worktree isolation is only useful if the orchestration framework knows to reach for it by default. The memory priming is only useful if the model is good enough to act on what it finds. When all three shift in the same six-month window, the effect is not additive.

Strip away the version numbers and the feature lists, and the functional picture is narrower than it looks. The evolution since November clusters around three areas. Context: the 1M window is generally available now, and trusty-memory loads project history before the PM does anything. Recall: the PM enters each session knowing what happened before, with a deeper bench of specialists to route to. Error checking: the model flags its own flaws, and the orchestration layer runs adversarial review unprompted. None of the twelve behaviors below trace to the model writing better code. They trace to the system getting better at knowing what it knows, remembering what it has done, and catching what it gets wrong.

Session Vitals

Before the behaviors, the stat block. This is the anatomy laid flat.

Metric Value Total cost (rack rate) $485.96 PM cost (claude-opus-4-8, 265 turns) $264.65 Subagent cost (sonnet 4,626 turns + haiku 605 turns) $221.30 Delegations (agent calls) 63 Skill + MCP tool calls 18 Wall-clock duration 14h 37m Human share (turn basis) 7% Autonomous share (turn basis) 93% Human-authored content prompts 9 Longest human prompt 84 characters

The cache numbers sit underneath all of it. The sonnet agents pulled 482,137,941 cache-read tokens against 136,569 fresh input tokens. The PM pulled 70,287,449 cache-read against 119,632 fresh. Those two figures are the reason a 14-hour run costs what a junior contractor costs for an afternoon. I come back to them below.

1. Context First

The first action after my prompt was not a plan and not code. It was three memory and context calls in a row. The first one failed: memory_recall: missing 'palace' (no --palace default configured). The PM did not abort or report the error to me. It adjusted the call — added the palace parameter — retried, and got a stored finding from earlier work.

Then it read Cargo.toml and README.md before planning anything. In its own words at 21:48: “I’ll start by orienting myself. Let me query project memory and check what ‘core services’ and ‘console’ refer to, since neither maps obviously to a crate in this workspace.” One minute later it had resolved both terms and declared the scope — the equivalent of reading the ticket history before touching the keyboard.

2. Plan Before Acting

Rather than starting, it named the parallel streams and stated why they could run independently before dispatching anything: “These are independent, so I’ll dispatch them in parallel” (21:49). Structure stated before action taken.

The pattern held under pressure. When I said let's do the top candidates at 22:03, it refused to treat that as a green light: “’The top candidates’ spans everything from a critical bug cluster to a 5,400-line refactor to design/ADR work — very different sizes and risk. Before I fan out agents (and tokens), let me confirm how aggressively to go.”

3. Agents, Narrated

Two agents went out in a single PM turn that cost $0.34 — one to ticketing to pull the open GH issues for the core services, one to local-ops to investigate and deploy the console. Both came back inside two minutes. Each got a defined, bounded job, and the PM said what each was for before sending it. Later waves followed the same shape: at 23:27 three agents launched in parallel for Wave 1, redeploying trusty-search, splitting a 5,421-line file, and drafting an architecture decision record.

4. Verify Before Committing

When I said let's do the top candidates, the PM stopped and asked rather than fanning out. It called AskUserQuestion and waited 19 minutes for my answer before dispatching a single engineer.

The same instinct showed up against its own team’s reports. After the rust-engineer reported PR #1097 done with tests, clippy, and fmt all green, the PM declined to take it: “Per my verification ownership I won’t take that at face value — especially since #1088 and #1089 were collapsed into one commit, and #1089’s core complaint ... may be only partially addressed.” It ran an adversarial PR review and a CI check in parallel before accepting the work. The verification was budgeted into the run.

5. Name What You Don’t Know

On the architecture decisions baked into ADR-0010, the PM drew the line cleanly: “it’s an architecture decision so I’ll draft it for your sign-off rather than implement blind.” It surfaced four open questions, each labeled blocking or non-blocking, and answered none of them unilaterally.

It applied the same judgment to a failure. When a memory write was blocked because the trusty-memory daemon held the palace write lock, the PM diagnosed the cause and made a call: “Memory write is blocked ... not worth a detour; the work is durably captured in the merged squash commit, closed issues, and commit messages.” It marked the gap, decided closing it wasn’t worth the interruption, and said so.

6. Conflict Prediction

Before any Wave 1 agent was dispatched, the PM named the collision: “#1096 and #607 both edit .line-cap-allowlist.tsv (splitting an allowlisted file requires removing/lowering its entry), so running them in parallel guarantees a merge conflict on that file.” It named the file, the two items that would collide, the mechanism, and the mitigation — sequential waves — in the same breath. The same pattern recurred in Wave 2, where it held #607 until #993 landed for the identical reason.

7. Unprompted Checkpoint

Unprompted, at 02:02, the PM produced a formatted two-section status checkpoint: a “Shipped to main” table listing each merged PR with its commit hash and a verification note, and an “In flight / staged” list with the state of each item still moving. Self-organized visibility, not a dashboard anyone designed for it.

8. Risk-Stratified Options

For the ADR-0010 decisions, the PM laid out four questions, each with named options and the cost of each. On the unknown-tag handling: Option P (permissive — risk: typo’d tags survive silently), Option L (allowlist — more control, requires per-index config), Option H (hybrid, the one it proposed). It recommended where it had a view and left the choice with me. At 03:14 it paused and asked what to work on next rather than self-selecting — which is where the 4.5-hour gap in the log comes from. It was waiting on me.

9. Specialist Routing

Routing stayed consistent by agent type. Issue reads and epic filing went to ticketing, daemon deploys to local-ops, implementation to rust-engineer. CI polling and merges were version-control’s; runtime QA went to api-qa, architecture feasibility to research. Across all 63 delegations, the PM did not hand an implementation task to ticketing or a CI task to the engineer.

10. Plans Sharpen with Information

When recon came back, the plan went from vague to specific. After the first two agents returned, the summary carried exact PR numbers, exact crate versions (trusty-search 0.24.4, trusty-memory 0.15.2, trusty-analyze 0.7.0), exact ports, and one specific stray process to clean up — none of which existed in my prompt.

After the issue specs arrived, “fix the top candidates” became a coordinated analysis: “these five cluster around one shared concern — the indexes.toml persistence / colocated / warm-boot scan paths — and #1088/#1089/#1090 in particular interlock through the config-write path, so a coordinated fix is safer than five isolated ones.” The prompt set a direction; the detail came from what the agents found.

11. Self-Maintained Cross-References

Throughout, the PM tracked issue numbers, file paths, commit hashes, and the relationships between them. It noticed that #1088 and #1089 had been collapsed into one commit and independently checked whether that collapse was safe — a cross-reference that was in no agent’s report, only in the PM’s own check against the original spec.

It also caught something I’d have missed: an external contributor, maui314159, had a live PR (#1082) covering a dependency that the #819 work needed. The PM surfaced it as a coordination point rather than duplicating or ignoring it. Later it drew the boundary precisely: “#819 isn’t blocked by conflict — it’s gated on accepting a contributor’s architecture decision, which is your call, not something I’ll auto-merge.” Then it reviewed the external PR in full before going further.

12. Clean Handoff

The session ended on a handoff, not on more output. After filing the final epic (#1119), the PM declared the scope complete, named the one item still in flight, and gave the resume mechanism: “Session stays paused (session-20260611-161949). That clears everything you asked for in this window. The only thing still in flight is #819 (KG ingest endpoint, building in kg-ingest) — I’ll relay its PR when it lands but won’t start anything new. Resume anytime with /mpm-session-resume.” I issued /exit 18 minutes later.

This is the strong version of stopping. The PM did not keep going to look busy, and it did not stop arbitrarily. It identified the boundary — what was done, what was still moving, where the human picks back up — and stopped there.

What the Instrumentation Adds

The twelve behaviors are what you’d see watching over the shoulder. The telemetry adds three things you can’t see that way.

The 7% is even smaller than it sounds. Nine human-authored prompts in 14.5 hours. Three were a word or a phrase — proceed, let's do the top candidates, in console running?. The longest content prompt I wrote all session was 84 characters: it doessnt show the individual consoles, theses should be tabs - and it should be am spa — typos and all. The orchestrator did not need a spec. It needed a direction and the occasional course correction.

The economics are a cache story. The sonnet agents read 3,531 cached tokens for every fresh input token. The PM ran at 587×; haiku, doing high-volume narrow tasks, hit 34,868×. Most of each turn’s context is pulled from cache at cache prices, not re-sent as fresh input. A 14-hour run that re-paid full freight for context on every turn would cost a multiple of $486. The caching is why an orchestration this long is economically viable at all. A note on what I actually paid: I’m on Anthropic’s Max plan at $200/month. The $485.96 above is rack rate — the right figure for understanding the economic value of the output. My marginal cost for this session was zero.

Waiting costs money, and you can see exactly where. The most expensive PM turns share one trait: cache_read=0 — the turns where context had to be rebuilt cold after a cache miss, each one corresponding to a long gap I left. The $7.26 turn, the single priciest of the session, came right after I typed proceed, following 1.5 hours of silence; it wrote 364,618 tokens of cold context to restart the pipeline. The $5.82 turn came after my 4.5-hour wait. The orchestrator picked up where it left off both times, with full context. It just had to pay to reconstruct it. In a system like this, my latency has a line item.

And one finding the behaviors undersell: the failures. The first memory call errored and the PM fixed and retried it. The memory write later failed against a held lock, and the PM diagnosed the daemon contention and moved on. A subagent dropped its connection mid-run on an infra hiccup, and the PM recorded the socket error and later resumed the job to finish it. Three failures, three absorptions, zero escalations to me. That is the part that reads most like a year-one IC: not that nothing broke, but that what broke got handled below my line of sight.

So What?

Four threads run under all of this.

The first is where the line now sits between supervision and delegation. For most of these tools, the answer has been “supervise closely” — read every diff, catch every drift. This session moved the line. I supervised at the level of direction and architectural calls; I delegated everything from PR review to conflict avoidance to failure recovery. The PM did its own adversarial review 13 times and bounced a starred approval. Verification did not disappear; it moved inside the loop, and it left a receipt.

The second is what changes when the coordination work — not the typing — is the part the tool does well. The code these systems write stopped being the interesting question a while ago. The interesting development is that the orchestration layer now decomposes work, predicts conflicts, routes to specialists, tracks provenance, and knows when to hand back. That is project management. When the typing is cheap and the coordination is the hard part, a tool that coordinates well is worth more than a tool that types fast.

The third is the one underneath everything else. No single behavior above is impressive on its own. A competent IC reads the ticket history, names the parallel streams, predicts a merge conflict, routes to the right person, and stops at the decision boundary — without being told. What changed is that they now appear together, unprompted, in one session — and this time there’s a receipt for every one of them. The dollar figure is not the headline. The instrumentation is. We can finally watch the whole thing work and count what it cost, behavior by behavior.

The fourth is about where to point the question. For a while the useful question was “what can the model do.” After this session I think the better question is “what can the trifecta do,” because none of the behaviors above trace cleanly to a single component. They come out of the interaction: a model that audits its own work, a harness that isolates parallel work in separate worktrees, and an orchestration layer that routes to the right specialist with project memory already loaded. Pull any one of the three and the session degrades — the self-audit goes nowhere without a pipeline to surface it, the parallel waves collide without isolation, the routing misfires without memory. The capability lives in the seams between the parts, not in any one of them.

One last note on the economics. The Max plan is $200 a month. For that, I ran a session that bills at $486 rack rate, and the output is good enough that going back to a metered model is hard to imagine. Dealers give the first one away for a reason. The Max plan is that first bag: the work is compelling enough to be structurally addictive, and the flat rate strips out the per-token friction that might otherwise make you stop and think. I don’t mean that as a complaint. It’s a description of how the pricing works on the user.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Explore the annotated session timeline — the turn-by-turn companion infographic for this session
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Discussion about this post

Ready for more?