Stop Reviewing Agent Code. Start Verifying It.

Tests used to be optional. Extra work you'd skip under deadline pressure, maintain grudgingly, delete quietly when they broke. Agents changed the math. Now tests are the only thing between you and blind trust in code nobody reads.

Simon Willison explains in Engineering practices that make coding agents work—the path to trusting agent output isn't better models. It's automated proof: TDD, runtime checks, conformance suites, and sandboxes that limit blast radius.

At ENVAIO, we shipped IoT firmware where a bad deploy meant physical device recalls—not a reverted PR. Verification was never theoretical. Willison is making the same argument for a different domain: if you can't prove the code works, the speed is a liability.

The Exhaustion Problem

That tires you out, right? We become full-time code reviewers and that's an exhausting sort of state of the world. — Simon Willison

Agents generate fast. Humans review slow. The bottleneck moves from writing to reviewing, and suddenly your best engineers spend their days reading AI output instead of building.

Willison's answer is blunt: stop reading, start verifying.

Red-Green TDD (Yes, Really)

Willison admits he hated TDD throughout his career—found it tedious. With agents? He doesn't care if the machine spins for a few extra minutes writing a failing test first.

I think I see people who are writing code with coding agents and they're not writing any tests at all. That's a terrible idea. — Simon Willison

His workflow: tell the agent how to run tests, say "use red-green TDD," let it churn. Five tokens of instruction. Code that proves itself as it's written—not code waiting for a human to validate later.

Here's what caught me off guard. Willison claims tests are "effectively free" now because agents write them without complaint. The old cost-benefit calculation collapses when generation costs approach zero. I don't think most teams have internalized this yet. The habit of skipping tests is deeply embedded, and "the agent will write them" requires trust in the agent's test quality that itself needs verification (and yes, that's recursive—still working through what that means in practice).

Beyond the Test Suite

Passing tests is necessary. Insufficient.

Anyone who's shipped production code knows: test suite passes, server won't boot. So Willison tells agents to start the server, use curl to exercise the API, verify actual runtime behavior. Often catches bugs tests missed entirely.

He built a tool called Showboat—48 hours old at the talk—that generates Markdown transcripts of these API exercises. Curl command, output, next endpoint, output. An auditable record.

Then there's conformance-driven development. He asked Claude to build a multipart file upload test suite that passes across Go, Node.js, Django, Starlette, and two others. Six frameworks, one test suite, reverse-engineered from implementations. Then coded his own framework against that suite.

Unit tests prove correctness. Smoke tests prove the system runs. Conformance suites prove interoperability. Each layer catches what the previous one misses.

The question I can't fully resolve: where does observability fit? Tests and curl cover pre-deploy. What happens when real traffic hits edge cases no test anticipated? Willison doesn't address monitoring, and that gap nags at me.

The Lethal Trifecta

Willison named "prompt injection" after SQL injection—and later called it a bad choice of name. His second attempt stuck better: the lethal trifecta. A model with access to private data, exposed to malicious instructions, and connected to an exfiltration vector. Cut any one leg and the attack surface collapses.

The most important thing is sandboxing. You want your coding agent running in an environment where if something goes completely wrong... the damage is greatly limited. — Simon Willison

He runs Claude Code for the Web in Anthropic's containers. Worst case? Someone destroys their VM. Click a button, get a new one. On his Mac, he runs with dangerously-skip-permissions despite being—his words—"the world's foremost expert on why you shouldn't do that."

I respect that honesty more than the usual security theater. But here's the catch: enterprise environments can't hand-wave data access. When agents touch customer data and production API keys, sandboxing isn't a checkbox—it's an architecture decision. The teams I've seen navigate this well treat it as a design constraint, not an afterthought.

Code Quality Is a Choice

Does code quality matter when agents write everything?

Context-dependent. Throwaway HTML tools? 800 lines of spaghetti, who cares. Long-lived systems? Quality matters more, not less.

Willison frames it sharply: if the agent outputs 2,000 lines of bad code and you ignore it, that's on you. Feed back refactoring instructions and you get code better than what you'd write by hand—because you'd skip that last hour of cleanup, but the agent won't.

He uses cookiecutter templates to set patterns before the agent starts. Agents follow existing patterns almost perfectly. Clean codebases amplify agents.

Why This Matters

TDD, smoke tests, conformance suites, sandboxing—none of these are new. What's new is that skipping them went from risky to reckless when agents started writing most of the code.

Willison reports a 49% Fibonacci speedup from a single prompt to Opus 4.6. He's oneshotting three new RSS feeds on his phone between conversations. The capability is real. The organizations that build the verification stack first won't just move faster. They'll be the only ones who can trust what they ship.

What Works

Start every agent session with test instructions. "Use red-green TDD" is five tokens. The ROI is disproportionate.

Go beyond unit tests. Start the server, hit the API with curl. Automated tests pass in isolation. Production runs in context.

Sandbox by default. Containers, VMs, whatever reduces blast radius. Assume the model will execute something destructive, and limit the damage.

Invest in templates and clean project structure. Agents mirror what they find. One well-written test shapes every test that follows.

Caveat: this works when you have engineers who know what good looks like. Tests written by agents need someone who can judge whether they test the right things. Speed without judgment is still debt—just with a fancier origin story.

Full talk: Watch on YouTube

The Exhaustion Problem

Red-Green TDD (Yes, Really)

Beyond the Test Suite

The Lethal Trifecta

Code Quality Is a Choice

Why This Matters

What Works

Related Articles

How Amazon Kiro Turns Prompts Into Verifiable Specs

Deloitte Found Where 93% of AI Budgets Actually Go

Your Next Customer Is an AI Agent. Is Your API Ready?