Stanford Tracked 120K Devs. AI ROI? Just 10%

Ten percent. That's the median ROI from AI coding tools across a Stanford study of 120,000 developers. Companies spend millions on Copilot licenses, Claude subscriptions, and custom deployments. The result? Most teams barely move the needle.

Yegor Denisov-Blanch explains in Can you prove AI ROI in Software Engineering? (120k Devs Study)—two years of research tracking AI impact across enterprise software teams. The methodology is solid: machine learning models trained on millions of expert evaluations, time-series analysis, cross-company comparisons. The findings are brutal.

From enterprise deployments, I've seen this pattern repeat. Teams buy tools, distribute licenses, measure activity metrics (keystrokes, suggestions accepted), then wonder where the productivity gains went. Activity isn't outcome. Pull request velocity isn't software quality. The ceiling isn't technical—it's organizational.

The Measurement Mirage

How do you evaluate a code commit? Stanford built something fascinating: a machine learning model that replicates human expert panels.

imagine you have a software engineer who writes a code commit and this code commit would be evaluated by multiple panels or of 10 and 15 independent experts who would evaluate that code commit across implementation time maintainability and complexity and then produce an output evaluation — Yegor Denisov-Blanch, Stanford

The approach scales expert judgment. Train on millions of panel evaluations, deploy the model across commits, validate against reality. It's expensive to build but necessary—subjective code quality metrics don't work at enterprise scale.

Most organizations skip this complexity entirely. They track lines of code, commit frequency, merge time. All activity metrics. None predict business value. At Tier, we learned this lesson with edge deployment—throughput numbers meant nothing if the deployed code broke scooters.

But here's the catch: even with sophisticated measurement, the Stanford results are depressing. Median gains of 10% suggest most AI coding investments barely clear noise thresholds. Companies spending millions see returns that could be matched by better test suites or faster CI/CD pipelines.

The Variance Signal

Ten percent median tells half the story. The variance tells the other half.

The research shows teams clustered in quartiles—some see massive gains, others see zero. The distribution isn't normal. It's bimodal. Winners win big. Losers see no benefit at all.

What separates the quartiles? The Stanford data doesn't dig deep here, but my experience suggests pattern: winning teams invest in constraints, not just tools. They build context systems, redesign review processes, automate quality gates. Losing teams distribute licenses and hope.

So here we took 46 teams that were using AI and we matched them with 46 similar teams that were not using AI and we measured their net productivity gains from AI quarterly — Yegor Denisov-Blanch, Stanford

The matched-pairs methodology is smart—control for team capability, codebase complexity, domain knowledge. Isolate the AI variable. But I suspect the teams that extracted value had systematic advantages beyond model access. Organizational readiness. Process discipline. Quality culture.

Speed without guardrails is technical debt at 10x velocity. AI amplifies what exists. Clean codebases get cleaner. Messy ones get messier faster.

Where ROI Breaks Down

ROI calculation assumes you can measure both investment and return. Investment is easy—license costs, training time, infrastructure. Return is hard.

Traditional software metrics don't capture AI impact well. Lines of code increase, but maintainability drops. Feature velocity accelerates, but technical debt compounds. Story points completed rise, but customer value stays flat.

The Stanford approach—expert panel replication—is better but still incomplete. Code quality metrics miss deployment issues, operational complexity, user experience. You can write beautiful code that solves the wrong problem.

I'm not convinced we know how to measure AI productivity at all. The field moves too fast. Evaluation techniques lag capability growth. Teams optimize for metrics that predict nothing about business outcomes.

What I find genuinely interesting about this data is how little correlation there seems to be between tool sophistication and team performance. The median stays stubbornly low despite massive investments in better models, improved context windows, enhanced reasoning capability.

The Tool Trap

Most teams approach AI coding as a tool purchase, not a capability build. Buy licenses, train developers, expect results. This worked for IDEs and debuggers. It doesn't work for AI.

AI coding requires different organizational support. Context systems to feed models relevant information. Quality processes to catch generated bugs. Review protocols to validate AI-suggested changes. Training programs to teach steering techniques.

Skip the infrastructure and the tools don't deliver. Deploy without guardrails and velocity gains get erased by debugging time. Focus on activity metrics and miss the productivity forest for the keystroke trees.

But here's what surprises me: even teams that invest in supporting infrastructure don't always see returns. The 10% median includes sophisticated organizations with dedicated AI engineering teams, custom evaluation systems, and extensive training programs.

Something deeper is missing. My hypothesis—and I could be wrong here—is that current AI coding tools optimize for individual productivity while software engineering is fundamentally collaborative. The bottlenecks aren't typing speed or syntax recall. They're requirement clarification, architecture decisions, integration complexity.

The Organizational Ceiling

Individual developer productivity isn't team productivity. Team productivity isn't business value. The gaps compound.

AI tools excel at isolated tasks—function generation, test writing, documentation. They struggle with system-level reasoning, cross-team coordination, architectural tradeoffs. Software engineering is increasingly about the second category.

companies spend millions on AI tools for software engineering. But do we actually know how well these tools work in the enterprise or are these tools just all hype? — Yegor Denisov-Blanch, Stanford

The hype promises transformation. The data shows incremental improvement at best. Why the disconnect?

Most AI coding adoption skips organizational change. Teams use new tools with old processes. Individual efficiency increases get absorbed by coordination overhead. Local gains disappear in system complexity.

The teams that break through—I suspect—redesign their engineering culture around AI capabilities. Different review processes. New quality definitions. Shifted skill development. They treat AI as infrastructure, not assistance.

Why This Matters

Ten percent median ROI means most AI coding investments don't clear opportunity cost thresholds. Organizations could get better returns from faster builds, clearer requirements, or reduced technical debt.

The variance signal is more important than the median. Some teams extract massive value while others see none. Understanding the difference could unlock widespread productivity gains. But the research doesn't provide that blueprint yet.

Data access determines competitive position. Organizations generating their own evaluation datasets—real production problems, dynamic measurement systems—maintain advantages over those using public benchmarks or vendor-provided metrics.

The measurement problem compounds every decision. Teams pick tools based on demos that don't predict production performance. They optimize for metrics that don't correlate with business value. The gap between evaluation and reality is expensive.

What Works

Measure outcomes, not activity. Code quality improvements matter more than commit frequency. Business value delivery beats keystroke velocity.

Build evaluation systems that match your codebase. Generic benchmarks don't predict domain-specific performance. Spend time creating relevant test problems.

Invest in organizational readiness, not just tools. Context systems, quality processes, review protocols. AI amplifies existing capabilities—build the foundation first.

Focus on constraints, not features. Teams that define clear AI usage patterns see better results than those allowing unrestricted access.

Start with measurement infrastructure. If you can't track code quality, maintainability, and business impact, you can't optimize AI usage patterns.

This works when organizations commit to systematic capability building. Most buy tools, hope for magic, measure the wrong things. They join the 10% median. The teams that break through treat AI as infrastructure requiring significant supporting investment.

Full talk: Watch on YouTube

The Measurement Mirage

The Variance Signal

Where ROI Breaks Down

The Tool Trap

The Organizational Ceiling

Why This Matters

What Works

Related Articles

How Amazon Kiro Turns Prompts Into Verifiable Specs

Architecture Decisions Drive 100x More Cost Than Code

Anthropic Looked Inside Claude. Here's What They Found

Explore by Topic

ai-coding(3 articles)

enterprise-ai(3 articles)

mlops(3 articles)