Anthropic's 150% Productivity Claim—And Why PRs Tell Half the Story

Pull requests up 150%. But what does that actually mean?

Boris Cherny explains in Inside Claude Code With Its Creator Boris Cherny—the product philosophy that's reshaping how AI-native tools get built. The core heuristic: don't build for today's model. Build for the model six months from now. Everything else—the UI, the scaffolding, the clever features—is disposable.

Here's the part that made me pause. Boris calls PR count "the simplest stupidest measure" of productivity. He's measuring it anyway. At Anthropic, engineer productivity is up 150% since the tool launched—tracked primarily by pull requests, cross-checked with commits and "lifetime of commits." The number is real. But I've watched teams hit similar metrics while defect rates climbed and review load exploded. The correlation between PRs and actual value? Still an open question—and one I keep coming back to.

Why Scaffolding Dies

At Anthropic, the way that we thought about it is we don't build for the model of today. We build for the model six months from now. — Boris Cherny

This changes how you prioritize everything. Traditional product development optimizes for today's constraints. AI-native development optimizes for constraints that haven't arrived yet.

The practical implication: features that feel essential today become dead weight in three months. Boris is explicit about this trade-off—you can build scaffolding for a 10-20% improvement, or wait a couple months for the model to do it natively. Most of that scaffolding gets "wiped out with the next model."

There's no part of the codebase that was around six months ago. It's constantly rewritten.

The Latent Demand Principle

Probably the single for me biggest principle in product is latent demand. — Boris Cherny

Watch what users already do. Productize that.

Users were creating markdown instruction files before the feature existed. They were asking the model to plan before writing code. They were copying shell commands into conversations manually. Each became a feature—not because the team invented it, but because they observed the behavior and made it frictionless.

This is a fundamentally different approach than most product teams take. Most teams brainstorm features, then convince users to adopt them. Boris describes the inverse: users show you what they want through their workarounds, and your job is to notice.

At Tier, we spent weeks building features we thought users needed. The ones that stuck? Almost always things users had already hacked together on their own. The pattern repeats.

What Makes a Feature "Real"?

Plan mode—there's no big secret to it. All it does is it adds one sentence to the prompt that's like please don't code. — Boris Cherny

This one surprised me. Plan mode—a feature that felt like sophisticated orchestration—is literally one sentence in the system prompt.

And here's the kicker: Boris thinks plan mode has a limited lifespan. "Maybe in a month. No more need for plan mode in a month." The model will get good enough to figure out when to plan versus when to code without explicit instruction.

The question I keep asking: if the killer features are this simple, where does the real leverage come from? My read (and I could be wrong here): it's not in the features. It's in the iteration speed and the willingness to delete what you just built.

Swarms That Ship

The first kind of big example where it worked is our plugins feature was entirely built by a swarm over a weekend. It just ran for like a few days. There wasn't really human intervention. — Boris Cherny

An engineer gave the agent a spec and an Asana board. The main agent spawned sub-agents, posted tasks, and the sub-agents picked them up. The plugins feature shipped. A few days. No human intervention.

I've seen plenty of multi-agent demos. Most are theater—impressive coordination that falls apart outside controlled environments. This is different: a real production feature, built autonomously, shipped to users. The gap between demo and deployment is closing faster than I expected.

The pattern they describe—"uncorrelated context windows"—treats multiple agents with fresh contexts as a form of test-time compute. Each agent brings unpolluted reasoning to the problem. It's brute force, but it works.

The Measurement Problem

Here's where I push back a bit. PR count is a proxy. A useful proxy—but it can mask the things that matter.

At enterprise scale, I've watched PR velocity climb while code quality dropped and rework tripled. The math only works if you're also tracking defect rates, review load, cycle time, and—most importantly—whether the features actually shipped value to users.

Boris acknowledges this: "simplest stupidest measure." The honesty is refreshing. But for teams trying to replicate these results, the risk is measuring activity and calling it productivity.

The questions I'd want answered: what happened to incident rates? To time-to-resolve? To the code that didn't need to be written in the first place because requirements were clearer? My sample size is limited to enterprise contexts, but the pattern I keep seeing is velocity gains that vanish when you look at net throughput.

The CLAUDE.md Question

If you hit this, my recommendation would be delete your CLAUDE.md and just start fresh. — Boris Cherny

Contrarian advice that I'm still wrestling with. The tool works best with minimal instructions. Too many rules cause the model to drift into over-compliance—doing exactly what you said instead of what you meant.

But here's my concern (and this might just be enterprise brain talking): in regulated environments, you need guardrails. Compliance requirements, security constraints, repeatability for audits. "Minimal prompts" might optimize for velocity while creating governance headaches downstream.

The answer probably isn't one-size-fits-all. Consumer tools can afford to be minimal. Enterprise workflows need more structure—even if that structure costs some capability.

Why This Matters

The economics of AI tooling are inverting. Features that used to require months of engineering can be rendered obsolete by a model update. The teams that win aren't the ones building the most sophisticated scaffolding—they're the ones iterating fastest and deleting boldly.

150% productivity gains are real. But the metric itself reveals the gap between what we can measure and what we actually care about. PR count tells you something. It doesn't tell you everything.

The deeper insight: AI-native tools should be built around latent demand and designed for obsolescence. If you're optimizing for today's model, you're already behind.

What Works

Build for the model six months out. Your current feature set is temporary scaffolding—treat it that way.

Watch what users already do. The best features aren't invented; they're observed and productized.

Keep instructions minimal. Add rules only when the model demonstrably drifts.

Measure outputs, not just activity. PR count is a starting point, not an answer. Track quality alongside velocity.

Expect constant rewrites. If your codebase has code from six months ago, you're probably moving too slowly—or the models haven't improved as fast as expected.

This works when models keep improving at the current pace. If capability gains slow down, the "build for the future model" heuristic breaks. Know which scenario you're betting on.

Full talk: Watch on YouTube

Why Scaffolding Dies

The Latent Demand Principle

What Makes a Feature "Real"?

Swarms That Ship

The Measurement Problem

The CLAUDE.md Question

Why This Matters

What Works

Related Articles

Your Next Customer Is an AI Agent. Is Your API Ready?

AI Agents: The Permissions Problem Nobody's Solving

Architecture Decisions Drive 100x More Cost Than Code