Most of the best ideas in software engineering aren’t new. They’ve been written up in books, argued over at conferences, taught in every “best practices” deck since the late 1990s. And most teams quietly don’t do them.
Not because anyone thinks they’re wrong. Test-driven development, design by contract, architecture decision records, mutation testing — ask a room of senior engineers whether these are good ideas and you’ll get nods. Ask the same room who practices them consistently under deadline pressure and the hands stay down. I’ve been in that room for twenty-five years, on both sides of the question. I’ve also been the engineering leader who let those practices slip because shipping the feature mattered more this quarter.
There’s a single economic reason these practices lose. The upfront cost is high, the payoff is real but distant, and human attention is the binding constraint. Write the test before the code, document the decision, specify the invariant — every one of those is a tax you pay now against a benefit you collect later, maybe, if the project lives long enough. Under deadline pressure, that’s a losing trade for a human. So we skip it, ship, and pay the interest later in bugs and confusion. Call it the impatience tax.
Agents don’t pay that tax. They have infinite patience for upfront rigor and roughly zero marginal cost for the tedious work that rigor demands. Writing a thorough test suite for code that doesn’t exist yet is psychologically brutal for a person and completely fine for a model. That single shift — the cost of patience going to zero — quietly inverts the economics of a whole list of practices we knew were right and gave up on anyway.
This isn’t a piece about what AI makes possible. Lots of things are possible. It’s about a narrower, more useful question: which disciplines did we already agree were correct, fight about for decades, and abandon for reasons that no longer hold?
Here are nine.
TL;DR
These nine practices share one structure: high upfront cost, distant payoff. Human attention is the constraint that kills them under deadline pressure.
AI removes the constraint. A failing test is the clearest prompt you can hand an agent; a spec is its input; an ADR is its context. The discipline becomes the interface.
TDD, design by contract, and property-based testing turn from “things we should do” into the most effective way to constrain agent behavior and prevent hallucinated correctness.
Documentation, ADRs, and living docs get a bilateral ROI: agents generate them from code, and they make agents far more effective in your codebase.
The catch is real. A 2025 METR randomized trial found experienced developers were about 19% slower with AI assistance. These practices pay off only when AI is used with discipline, not as autocomplete.
1. Test-Driven Development
Start with the practice most teams abandoned first.
Writing tests before code was always the theoretically superior move. It forces you to define the interface before you build behind it, catches bugs at the moment of definition instead of during integration, and leaves behind a living specification of what the code is supposed to do. Kent Beck made the case decades ago and the case held up.
Almost nobody did it consistently. The reason is psychological, not technical. Writing detailed tests for code that doesn’t exist yet, while a deadline breathes on your neck, feels like building scaffolding for a house you haven’t designed. Your brain screams at you to just write the function. So you write the function, promise yourself you’ll add tests after, and — well. You know how that goes.
Now flip the perspective. To an agent, a failing test isn’t scaffolding. It’s the clearest possible specification of intent you can provide. “Make this pass, don’t break anything else” is an unambiguous, machine-checkable instruction, which is exactly what a probabilistic system needs to stay honest. The test suite becomes a guardrail that prevents the most dangerous failure mode in AI-assisted coding: confident, plausible, wrong. Hallucinated correctness dies against a red bar.
TDD went from the discipline most teams couldn’t sustain to one of the best tools we have for bounding what an agent is allowed to claim it did. Same practice. Opposite economics.
2. Spec-Driven Development and Design by Contract
Bertrand Meyer formalized design by contract in the 1980s and built it into the Eiffel language: specify preconditions, postconditions, and invariants, then let the implementation follow from the contract.
The idea was sound and the adoption was thin, for one stubborn economic reason: the contract only pays off if someone else writes the implementation from it. If you’re writing both the spec and the code, the spec is overhead — you already know what you meant. The contract’s value lives in the handoff, and for most of software history there was no cheap handoff to hand it to.
Now there is. You write the contract; the agent writes the implementation from it. Spec-driven development stops being a documentation chore and becomes the actual control surface for delegation. The spec is the part requiring human judgment about what the system should do. The implementation — the part that used to eat the hours — is the part you delegate. Meyer’s economics finally close, forty years late, because the missing party in the transaction showed up.
3. Architecture Decision Records
Why does the codebase look like this? Why Postgres and not Dynamo, why this queue, why the weird module boundary that everyone trips over?
ADRs were the right answer to that question — a short dated record of each significant decision, the context, and the alternatives rejected. The discipline almost never held. Same shape as everything else here: the cost is immediate (stop, write the thing) and the value accrues slowly, mostly to some future engineer who isn’t in the room yet.
Two things flipped at once, which makes this one more interesting than the rest. First, agents can generate ADRs from an existing codebase — read the git history, the dependency choices, the structure, and reconstruct the decisions that produced them. The retroactive cost of documentation drops toward zero. Second, and this is the part people miss: existing ADRs dramatically improve what an agent can do in your codebase. An agent that can read why you chose eventual consistency won’t keep proposing changes that assume strong consistency.
So the ROI went bilateral. Agents help you write ADRs, and ADRs help agents help you. The practice that used to only cost now pays on both ends.
4. Continuous Code Review
“Catch issues early” has sat on every best-practices list since Extreme Programming put continuous review on the map. The advice was never controversial. The bottleneck was always the same: human reviewer attention is finite, expensive, and easily exhausted. So review collapsed into batch PR review — a tired engineer reading a 600-line diff on a Friday afternoon, approving most of it on faith.
AI review on every commit — not batched at the PR boundary, but running as code lands — is moving from aspirational toward baseline: increasingly the default on teams that have wired it in, though not yet universal. The marginal cost of a careful read went to nearly nothing, and the read happens while the context is still warm.
But this one comes with an emergent problem worth naming, because it bites teams that adopt the tooling without rethinking the model. PRs are getting larger and arriving faster under AI-assisted development. An agent can produce a 2,000-line change in an afternoon. If your review model still routes everything through a human approver at the end, that human is now the rate limiter, drowning in volume they didn’t generate and can’t realistically read. AI review on every commit is part of the answer. The harder part is restructuring what the human reviews — architecture, intent, the decisions a model shouldn’t make alone — and letting the machine handle line-level correctness continuously. Adopt the tool without rethinking the workflow and you’ve just built a faster way to overwhelm your best reviewer.
5. Pair Programming
Pairing always looked expensive in the most obvious way: two engineers, one task, double the salary against a single unit of output. That intuition was wrong on the numbers — the measured overhead from the pair-programming studies was closer to 15%, often recovered through fewer defects — but the 2× gut feeling is what drove the decisions. The benefits were real — knowledge transfer, real-time review, fewer dumb mistakes — but the perceived math meant most teams reserved it for critical paths, gnarly bugs, or onboarding a new hire. A luxury, rationed.
The pair is now a human and an agent, and it’s available to every engineer continuously, not rationed to the critical path. The knowledge-transfer benefit generalizes — the agent can explain unfamiliar parts of the codebase on demand. The real-time-review benefit generalizes — a second set of eyes on every line, every time, without scheduling two calendars. The economics that made pairing a rationed luxury simply don’t apply when one half of the pair has near-zero marginal cost.
Worth a caveat: agent-as-pair is genuinely good at the review and explanation half of pairing, and weaker at the part where a human partner pushes back on a bad design before you’ve written a line. You still need humans pairing with humans for that. But the day-to-day, line-by-line version of pairing just became free, and that’s most of what pairing was for.
6. Mutation Testing
Coverage numbers lie, and most engineers know it. Eighty percent line coverage tells you eighty percent of your lines got executed by a test — not that any of those tests would notice if the behavior broke. Mutation testing is the honest measure: it deliberately introduces bugs (flip a comparison, drop a line, change a constant) and checks whether your test suite catches them. If a mutant survives, you have a test that runs code without actually validating it.
Mutation testing was always the gold standard and almost never run continuously, for one reason: it’s computationally expensive. You’re effectively running your whole suite many times over, once per mutation. On a real codebase that’s brutal. So it lived in research papers and the occasional heroic CI job that someone eventually disabled for being too slow.
That constraint is mostly gone — compute is cheap and parallel, and we got more comfortable spending it. And the practice arrived right when we suddenly need it most. AI-generated tests have a characteristic failure mode: they drift toward coverage metrics without meaningful assertions. The model writes a test that calls the function, exercises the path, and asserts almost nothing of substance — green checkmark, zero protection. Coverage looks great. Mutation testing is the thing that catches exactly that. It’s the verification layer for a verification layer, and it matters more now than when it was invented, because now a machine is writing the tests.
7. Living Documentation
Documentation was always supposed to be a first-class artifact. It almost never was, and the reason is by now familiar: writing docs is tedious, and the penalty for stale docs accrues slowly and lands on someone else. So docs rotted. Every team has a wiki that’s a graveyard of half-true pages from two reorgs ago.
AI changes both halves of the equation at once. It generates docs from code, so the writing cost drops. And it consumes docs as context, so the docs earn their keep immediately — a well-documented codebase is a measurably more useful codebase for an agent working in it. The ROI is immediate and bilateral, same structure as ADRs.
There’s a quiet shift hiding in there. Documentation used to be written for humans who’d mostly never read it. Now it’s also written for the agent that will read it on every task, which means stale docs don’t just confuse a future engineer — they actively degrade your tooling today. The feedback loop tightened from months to minutes. That’s the kind of change that actually moves behavior, because the cost of skipping it shows up now instead of later.
8. Runbook Generation from Incidents
On-call always leaned too hard on tribal knowledge. The person who knows why the payment service wedges at 3 a.m. is asleep, on vacation, or left the company last spring. Writing a runbook after each incident was obviously the right move and reliably the thing nobody did, because the incident was over and everyone wanted to go back to bed.
Incidents become runbooks automatically now. The agent has the incident timeline, the chat transcript, the commands that resolved it, the postmortem — and it can turn that into a structured runbook while the details are fresh, without asking an exhausted engineer to relive the night. The cost that used to fall right when motivation was lowest now falls on a system that doesn’t get tired or resentful.
I’d treat the generated runbook as a draft a human still signs off on, not gospel. But “imperfect draft, reviewed in five minutes” beats “blank page nobody ever fills in,” and that was always the real competition.
9. Property-Based Testing
Example-based tests check the cases you thought of. Property-based testing is stronger: you specify the invariants a system must always satisfy — reversing a list twice returns the original, a serialized-then-deserialized object equals the original, the account balance never goes negative — and the framework generates hundreds of adversarial inputs trying to break them. QuickCheck pioneered the approach; it finds the edge cases you’d never have written by hand.
It never went mainstream outside a few communities, and the bottleneck wasn’t tooling — good property-based libraries exist for most languages. The bottleneck was writing good property specifications. Identifying the right invariants requires deep domain reasoning: you have to understand the system well enough to state what must always be true, which is harder than writing a few example cases. Most engineers, under pressure, defaulted to the easier thing.
This is where AI helps in a way that’s less obvious than “it writes the code.” A model can generate property suites from a spec, and — more usefully — it can reason about what invariants a system should satisfy in the first place, surfacing properties you hadn’t articulated. That’s the expensive, judgment-heavy part it actually offloads. Combined with mutation testing to keep the generated properties honest, you get a testing approach that was always more powerful than example-based testing and was always too expensive in human reasoning to adopt widely.
The Catch
I’d be selling you something if I stopped there, and the data won’t let me.
In July 2025, METR ran a randomized controlled trial with experienced open-source developers working on real tasks in repositories they knew well. The developers expected AI assistance to speed them up. It slowed them down — by roughly 19%. METR’s February 2026 follow-up found that gap narrowing, and reversing on some measures, as the same kind of developers gained real experience with the tools — which is to say the 19% was a snapshot of the unfamiliar, undisciplined path, and it closes precisely as people pick up the habits this piece is about.
That finding is real and it isn’t a contradiction of everything above. It’s the missing condition. Every practice in this piece works because it imposes structure on the agent — TDD as a guardrail, the spec as input, the ADR as context, mutation testing as the check on the check. Used that way, with discipline, AI is constrained toward correctness. Used the other way — as autocomplete, as a vibe-coding partner you don’t supervise — you get more code, faster, with less correctness and a slower path to done once you account for the cleanup. The METR developers, working in code they already understood deeply, may well have been paying exactly that tax: accepting plausible suggestions that took longer to vet and fix than writing it themselves would have.
So the inversion isn’t automatic. The cost of patience dropped to zero, which makes the rigorous path finally affordable. It does not make the undisciplined path good. If anything it makes discipline more important, because a tool that produces plausible output at high volume is precisely the tool that most needs a guardrail you can’t talk your way past. A red test bar doesn’t care how confident the model sounds.
What To Do With This
The interesting question was never “what does AI make possible.” That list is enormous, mostly speculative, and not very actionable. The better question is the one this whole piece is built on: which practices did we already know were right, argue about for decades, and quietly give up on?
That list is short, specific, and yours to write. Go pull your own team’s “we should really do this but we don’t” backlog — the standing items in retros that everyone agrees with and nobody owns. I’d bet most of them have the same economic shape: high upfront cost, distant payoff, killed by human impatience under deadline. Test coverage on the legacy module. The runbooks. The ADRs for the three decisions everyone keeps re-litigating. The integration tests that would’ve caught last quarter’s outage.
Run each one through a single question: was this abandoned because it was wrong, or because it was expensive in human patience? The wrong ones, leave abandoned. The expensive-in-patience ones just got cheap. Those are the ones to pick back up first.
The impatience tax got repealed. The disciplines it used to make unaffordable are sitting right there, mostly unchanged, waiting for someone to notice the price changed.
Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.
Related reading:
The Other Shoe Has Dropped — Why enterprise AI bills don’t match the per-token price collapse
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders











