The Other Shoe Has Dropped

The Economics of Enterprise Inference Usage

May 29, 2026

Two stories from the last two weeks. Uber burned through its entire 2026 AI budget in four months on Claude Code, with COO Andrew Macdonald telling the Rapid Response podcast that the link between that spend and shipped consumer features “is not there yet.” The Information had the underlying numbers a few weeks earlier: engineer adoption from 32% to 84% between December and March, heavy users running $500–$2,000/month in tokens, and CTO Praveen Neppalli Naga torching $1,200 in a two-hour demo. Same week, Microsoft told thousands of engineers in its Experiences + Devices division that their Claude Code access is going away. Windows Central, summarizing The Verge’s Notepad scoop, has the cutoff at June 30 — end of fiscal year — with cost as the actual driver even though EVP Rajesh Jha framed it publicly as convergence on Copilot CLI.

Two of the most AI-forward enterprises on the planet, same tool, same week. The “AI is failing” takes were live within hours.

I don’t buy that framing.

The headlines are getting it wrong. Uber didn’t cancel anything — adoption ran ahead of the budget and the company blew its annual spend keeping up. That’s a planning failure, not a verdict on the tool. Microsoft didn’t divorce Anthropic either; they’re still consuming Claude through Azure Foundry and M365 Copilot. What they cancelled is a specific license — Claude Code at the engineer-seat level — because engineers preferred it over GitHub Copilot CLI and the division was paying for that preference.

What both stories show: AI is a new tool and we haven’t learned to use it well yet. The teams over budget pointed it at problems it wasn’t the cheapest way to solve, then let it decide for itself how much work to do per task.

I’ve made the cloud parallel here before. Early cloud was expensive and misused. Lift-and-shift workloads routinely ran two or three times their on-prem cost — I watched that play out across teams I ran, and it took years to correct through architecture. Then the industry learned: right-sizing, reserved instances, autoscaling, serverless where it fit, on-prem where it didn’t. The bills came down. Not because compute got dramatically cheaper, but because we got more careful about what we asked the cloud to do. AI is in the same phase. Cheap per-token, expensive per-task, and the gap is architectural.

A few weeks ago I ran controlled head-to-head tests on Opus 4.6 and Opus 4.7 against identical coding tasks. Both models passed every test. Opus 4.7 cost 3.6× more to do it. Same outcomes, same rate card, dramatically more tokens.

Finout’s analysis of production deployments tells the same story at scale: up to a 35% cost increase overnight, driven by tokenizer changes that don’t show up on the per-token rate card. Not one team’s bad luck — the shape of the bill across the enterprise AI buyer base right now. The second of two shoes on AI economics.

I wrote about that version-to-version cost drift in detail. Providers can collapse per-token prices in public while the per-task bill drifts upward in private. The first shoe was the per-token price collapse that made everyone optimistic. The second is the behavioral and architectural cost overhang now landing on quarterly P&Ls.

TL;DR

Per-token costs at GPT-3.5-equivalent performance are down roughly 280× since late 2022, per Stanford’s AI Index 2025. Vendor revenue tells the opposite story: Anthropic’s annualized revenue went from $1B in January 2025 to $30B by April 2026 — a 30× move in 15 months, coming from enterprise inference, not consumer subscriptions.
Gartner’s April 2026 survey: just 28% of AI use cases fully meet ROI expectations, 78% of IT leaders report material AI charges that didn’t show up in any procurement model.
Gartner estimates agentic workflows consume 5–30× more tokens than equivalent chatbot interactions; Stanford’s Digital Economy Lab puts the upper bound for coding agents at 1,000×. The cost driver isn’t the model — it’s the workflow architecture wrapped around it.
Two patterns hold the line in production. Search-first architectures put inference at the end of a deterministic pipeline. Consolidated single-shot designs replace multi-call chains.
Inference is a power tool, not a default. Use it with specific ROI goals per call, apply it to code solutions rather than to directly solve problems, and bound it with deterministic structure on both ends.

Why the Per-Token Savings Didn’t Reach the Invoice

Per-token economics of frontier models have been collapsing for two years. Stanford’s AI Index 2025 puts the decline at roughly 280× from a late-2022 baseline at GPT-3.5-equivalent performance. Most enterprise budget conversations in 2024 started from that headline. The implicit assumption: bills should be going down.

They’re not. The clearest read comes from the vendor side. Anthropic’s annualized revenue went from $1B in January 2025 to $30B by April 2026, with roughly 80% from enterprise and API usage rather than consumer subscriptions. Anthropic now discloses 1,000+ customers spending more than $1M per year — a cohort that doubled in under two months — alongside roughly 300,000 business customers. The mid-tier ($100K–$1M/year) grew 7× year over year.

Menlo Ventures’ 2025 State of Generative AI report cross-checks at the market level: enterprise GenAI spend tripled to $37B in 2025, with LLM API consumption alone at $8.4B by mid-year. The high tier shows up in named deals: Snowflake’s $200M multi-year partnership implies a $50–70M annual run rate from one customer, and Deloitte is deploying Claude across 470,000 employees.

For a typical enterprise running multiple production AI workloads, $500K–$2M per year is now the realistic floor. Fortune 100 is running $10M–$50M+, the most AI-intensive past $100M. The Gartner numbers point the same direction: just 28% of AI use cases fully meet ROI expectations and 20% fail outright, and 78% of IT leaders report material AI charges that didn’t show up in any procurement model.

The per-token chart is real. The invoice is also real. What closes the gap is behavior. Three behaviors specifically.

Models do more work per task. Reasoning models reason. Agentic loops loop. The prompt that used to consume 4K tokens now consumes 40K because the assistant explores, plans, second-guesses, and verifies. Some of that is valuable. Much of it is the model performing thoroughness in a way that costs you money. The Opus 4.6-to-4.7 jump I documented earlier: same task, same outcome, 2.9× more output tokens and 4.8× more cache reads.

Workflows fan out. A “single” task in a modern agentic system might trigger a planner, researcher, coder, reviewer, and summarizer. Each makes its own LLM calls over overlapping context. Gartner’s March 2026 analysis puts agentic workflows at 5–30× the token consumption of an equivalent chatbot ask. Stanford Digital Economy Lab’s April 2026 arXiv paper goes further: coding agents can consume 1,000× more tokens than equivalent chat completions. The agent isn’t more expensive because it’s smarter. It’s more expensive because it’s louder.

Context windows fill themselves. Long context is a feature in marketing and a bill in practice. In our own enterprise Claude.AI usage — 82,852 messages from 329 employees over 3.5 months, audited via the Anthropic Compliance API — the average request carried 366,000 input tokens, mostly from 10-turn conversations dragging accumulated history forward into every new turn. Most production systems I’ve audited show the same fingerprint: pipelines paying for context they aren’t actually using.

None of this is fraud and none of it is mysterious. It’s the natural consequence of letting probabilistic systems decide how much work to do on every call. The savings from cheaper tokens were real. They just got consumed by an order of magnitude more tokens per task.

What I’ve Found Shipping These Systems

The teams handling this well aren’t the ones cutting AI usage. They’re changing the shape of how they use it.

The pattern I keep coming back to: treat inference as the expensive step at the end of a mostly deterministic pipeline. Do the cheap, structured work in code. Reserve the model call for the part that actually requires judgment. Then bound the call hard — context budget, output budget, quality gate on whether the call even runs.

Two examples from systems I’ve been building illustrate this from different angles.

Example 1: Code-Intelligence — Search-First Architecture

The naive version of a code review tool is obvious: dump the changed files into Claude, ask for a review. That works. It also costs roughly $0.03 per operation, scales linearly with repo size, and produces a lot of review output you didn’t need. Claude.AI offers a code review service — it ended up costing us thousands a month for just a few repos. Augment Code offers a well-regarded one as a GitHub app, but charges a platform fee (a meaningful fraction of our Anthropic spend) just to connect.

So we built our own. It leverages a multimodal search/RAG/KG engine I’d already built, so this wasn’t from scratch.

The version I actually ship uses a multi-tier search pipeline with the LLM call at the very end:

Stage 1: Vector Search    (~$0.0002 per query, semantic similarity)
Stage 2: BM25 Reranking   (~$0.0001 per query, lexical relevance)
Stage 3: Static Analysis  (~$0.0001 per query, AST + symbol resolution)
Stage 4: Quality Gate     (free, deterministic threshold check)
Stage 5: Single LLM Call  (~$0.03 per call, only if Stages 1-4 passed)

The first four stages cost about $0.0004 combined. They do the bulk of the work: deciding what code is actually relevant, ranking it, pulling structural relationships, and deciding whether the result is even worth asking an LLM about.

Hard budget controls run through the whole pipeline:

# Budget enforcement, not aspiration
MAX_CONTEXT_FILES = 6          # cap on what we send to the model
MAX_REVIEW_WORDS = 500         # cap on what the model returns
RELEVANCE_FLOOR = 0.005        # quality gate before calling the model

if combined_relevance_score < RELEVANCE_FLOOR:
    # no point spending $0.03 to get a review of weakly-related code
    return SkipReason("below relevance floor")

context = select_top_n(ranked_results, MAX_CONTEXT_FILES)
review = llm.review(context, max_output_tokens=MAX_REVIEW_WORDS * 1.4)

The RELEVANCE_FLOOR check is the part to underline. A meaningful percentage of review requests in real codebases don’t justify an LLM call at all — the changes are mechanical, the related code trivial, or the search signal weak enough that whatever the model says will be hallucinated context. Refusing to spend $0.03 on those cases is where most of the savings come from.

Rough economics across a quarter of usage:

Approach Cost per operation LLM calls per 1K operations Direct “review the diff” ~$0.030 1,000 Search-first with gates ~$0.0034 average ~430

About 80% of the workflow logic ends up deterministic: search, ranking, static analysis, gating. The model handles the last 20% — judgment on curated context. Same outcome from the user’s perspective, roughly an order of magnitude cheaper, with more predictable failure modes because most of the pipeline is debuggable code rather than prompt behavior.

The limitation: this is more work than wiring up a single LLM call, and the gates are only as good as your search infrastructure. The payoff is on the cost and determinism side, not on speed of initial implementation.

Example 2: duetto-intelligence — Context Injection Instead of Replacement

The second pattern comes from duetto-intelligence, internal tooling I’ve been building against that same enterprise Claude.AI usage — 82,852 real employee messages over 3.5 months, not a thought experiment. The problem here isn’t “should we call the LLM at all.” It’s: given that our people are already routing structured-data questions through a $0.274-per-request multi-turn Sonnet conversation, what’s a cheaper path that doesn’t degrade the answer?

The audit data made the gap concrete. Average request: 366K input tokens, ten-turn conversation, $0.274 to Anthropic. The same query answered through a Haiku single-pass against deterministically-retrieved internal data: $0.0009. A 300:1 cost ratio on the slice of traffic about structured product knowledge, CRM/account prep, JIRA, people and org lookups.

Not all traffic. Somewhere in the 35–40% range based on classified samples. About half of remaining queries genuinely need full Sonnet or Opus reasoning — writing, debugging, free-form analysis — and shouldn’t be intercepted at all.

The framing matters, because it’s easy to mis-read this as “replace Claude with a smaller model.” It isn’t. duetto-intelligence acts as a context injection layer in front of the user-facing model. When a query has structured-data intent, we route a sub-query to DI, get back a bounded structured result, and inject that into the prompt the larger model sees. The expensive model still does the reasoning — it just stops being responsible for the deterministic data retrieval it’s bad at and expensive for.

The naive design for the routing layer looks like this:

1. Classify the user's intent             → LLM call (~150 tokens)
2. Plan which subsystems to query         → LLM call (~250 tokens)
3. Call subsystem A, summarize response   → LLM call (~300 tokens)
4. Call subsystem B, summarize response   → LLM call (~300 tokens)
5. Synthesize a final answer              → LLM call (~250 tokens)
                                          Total: ~1,250 tokens, 5 calls

Each call is plausible on its own. Together they’re a tax on every user interaction. Latency stacks linearly with calls, and any one of the five can hallucinate in a way that corrupts the rest of the chain.

The consolidated design replaces three steps with deterministic code:

1. Classify intent                  → LLM call  (~80 tokens, tight tagger prompt)
2. Fan-out to subsystems            → code      (0 tokens, intent → call map)
3. Consolidated synthesis           → LLM call  (~200 tokens, structured input)
                                       Total: ~280 tokens, 2 calls

The trick is the intent classifier. It produces a tag from a fixed vocabulary of 87 tags — revenue.query.ytd, forecast.compare.year_over_year, account.lookup.contact, and so on. Each tag maps deterministically to a set of downstream calls in plain Python. No LLM in the routing step. The model isn’t asked “what should we do?” It’s asked “what is the user asking about?” — a much smaller, bounded question.

We validated against a 50-query test corpus drawn directly from the compliance data — real questions people had asked the model in production. After tuning, 100% of those queries land on the fast path with no LLM call required for routing. That’s the proof the deterministic-discipline part holds at the boundary; the routing isn’t quietly falling back to a second model call to bail itself out.

Budget enforcement is explicit in every prompt template:

SOURCE_CHAR_BUDGET = 600    # per data source pulled into context
OUTPUT_TOKEN_BUDGET = 200   # cap on synthesis response
INTENT_TAG_VOCAB = load_intent_taxonomy()  # 87 tags, versioned

def synthesize(intent_tag: str, sources: list[Source]) -> str:
    trimmed = [s.truncate(SOURCE_CHAR_BUDGET) for s in sources]
    return llm.complete(
        prompt=template(intent_tag, trimmed),
        max_tokens=OUTPUT_TOKEN_BUDGET,
    )

The economics, against measured baselines rather than estimates: current Claude.AI Chat spend across the 329-user population runs $15,264 over 3.5 months. Roughly $4,360/month, driven by that $0.274-per-request multi-turn average. If DI intercepts the 35–40% of traffic that’s structured-retrieval underneath, projected savings come in around $1,500–1,700/month, or $18–20K/year on this single user population. The leverage isn’t from picking a cheaper model. It’s from refusing to pay Sonnet rates to answer questions a deterministic system already has the data for.

The intent vocabulary is the contract. New capability means a new tag, a new downstream mapping, a new prompt template. The model never has to invent structure on the fly. This is what people mean by “use the LLM to code solutions, not to solve problems directly” — the routing logic lives in code, the tagger is a thin call, the synthesis is bounded.

The source-character budget matters more than the output budget. The compliance audit confirmed it: production overspend is on the input side. 366K input tokens against a few hundred output. Models will happily consume whatever context you hand them. Trimming at the source — 600 characters per source, no exceptions — is how you keep per-call cost from drifting upward as the system gets more capable.

The limitation: this only works on the routable slice. The pattern isn’t “eliminate inference.” It’s “stop spending $0.274 to answer questions that have a structured answer at $0.0009.”

Why “Just Use a Cheaper Model” Doesn’t Save You

A reasonable objection: aren’t the cheaper, smaller models supposed to handle this? Why not route everything to Haiku or an open model and call it solved?

The pricing math seems to support it. The behavioral math doesn’t.

Two things go wrong when you swap a cheaper model into an unstructured workflow. Cheaper models are usually less efficient per task — more turns to converge, more exploration, more hallucination, which means more retries and verification calls. A model 5× cheaper per token can run 1.5–2× more expensive per completed task if your workflow lets it spin.

And the workflow itself is where most of the cost lives. The 5–30× multiplier is structural, not modal — it exists regardless of which model you point at it. Switching from Sonnet to Haiku inside an unbounded agent loop changes the per-token cost. It doesn’t change the loop.

Model choice is a 2–5× lever. Architecture choice is closer to an order-of-magnitude lever in the systems I’ve shipped — consistently larger than what model swaps deliver. Most teams are over-tuning the model selection and under-tuning the structure around it.

The default assumption — including from vendors with strong incentives to sell you more tokens — is that the answer to AI cost is buying inference more cleverly. The actual answer is using inference less, more deliberately, with hard bounds on what each call is allowed to do.

How I Think About Inference Now

Inference is a power tool. Not a default. You don’t reach for it when a search query, a regex, or a switch statement would do. You reach for it when you need probabilistic judgment over unstructured input. Every call you don’t make is the cheapest call.

Use it to code solutions, not to solve problems. The highest-leverage use of LLMs in my workflow is generating the deterministic code that then handles the workflow without further LLM calls. A model that writes you a 50-line classifier is more valuable than a model that acts as the classifier on every request forever. The first costs tokens once. The second costs tokens every transaction for the life of the system.

Wrap every call in a budget. Context budget on the input, token budget on the output, quality gate on whether the call runs at all. Treat the LLM call as you’d treat a paid API with rate limits and SLA penalties. Because it is.

Set specific ROI targets per call. “AI-assisted code review” is too coarse to optimize. “Reviewing files with relevance score > 0.005, capped at 6 files, returning 500 words” is something you can measure cost-per-outcome on. Even loose ROI math at the call level surfaces where you’re paying for theater.

Treat behavioral cost as the primary risk. Model rate cards will keep coming down. They are not your problem. Your problem is what your pipeline asks of the model and what the model decides to do once asked. That’s the line item that grew while the unit cost dropped 280×. That’s the shoe that just dropped.

What This Means If You’re Running an AI Budget

Three things to look at, in order of how much they’ll move the line item.

Audit the call graph, not the rate card. Pull a representative day of production traffic and trace the actual LLM calls per user task. Count them. Most teams find a handful of workflows producing the majority of cost, and most of those have 2–4 LLM calls that could be replaced by deterministic code. That’s the consolidated-design pattern from the duetto-intelligence example. 50–80% reductions are common when you actually look.

Put quality gates in front of inference. For any workflow where the LLM call is expensive and the input quality is variable, add a deterministic check that decides whether the call is worth making. That’s the search-first pattern from the code-intelligence example. The savings come from the calls you don’t make, which never show up on the invoice.

Set hard context budgets and enforce them in code. Per-source character limits, per-call token caps, no “just in case” context stuffing. The output budget gets attention because it’s visible. The input budget is usually where the actual money goes.

None of this requires changing models, switching providers, or making bets on the next frontier release. It’s architectural work inside the pipeline you already have — work the per-token price chart has been letting people defer.

The teams that do it over the next two quarters will look like they got a 5–10× cost improvement from “AI getting cheaper.” The teams that don’t will look like AI got 3–4× more expensive while everyone else’s costs fell. Same providers, same models, same rate cards. Different shoe.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Opus 4.6 vs 4.7: The Real Cost of Incremental AI Improvements — The first shoe, on per-task cost drift between model versions
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Daniel in the world

May 29

Really superb, Bob. Aligns with what we're seeing and working towards but as usual you are seeing ahead, making the harder insights easier to grasp.

I do think it also points to the increasing benefits of some strong open weight model capacity. The newer Qwens, for example, can do great work given the right framing up, at much lower cost.

Discussion about this post

Ready for more?