- Published on
Proving AI ROI in Software Engineering
- Authors

- Name
- Ptrck Brgr
Pull request count is up 14%. Code quality dropped 9%. Rework tripled. Effective output? Flat. That's AI without measurement discipline.
Yegor Denisov-Blanch at Stanford studied 120,000 developers for Can you prove AI ROI in Software Engineering?—usage quality and codebase hygiene determine outcomes, not activity metrics. Full discussion: https://www.youtube.com/watch?v=JvosMkuNxF8.
Codebase hygiene separates AI wins from AI disasters. Teams with strong tests, typing, and modularity see sustained gains. Those deploying AI into technical debt watch quality metrics tank—PR velocity spikes while defect rates climb until leadership pulls the plug.
The Usage Quantity Trap
120,000 developers studied. Result: how much AI you use predicts nothing. Token consumption? Loose correlation at best. Some moderate usage actually harms output—a "death valley" where teams over-rely without discipline.
AI usage quality matters more than AI usage value. — Yegor Denisov-Blanch, Stanford
Quality, not quantity. How you use AI matters. How much? Nearly irrelevant.
Codebase Hygiene: The Hidden Multiplier
Clean codebases amplify AI. Tests, typing, documentation, modularity—all multiply gains. Technical debt? AI accelerates the entropy. Messy codebases get messier faster.
Clean environments multiply AI output. Dirty ones multiply waste.
When Not to Use AI
Knowing when to skip AI matters more than knowing how to use it. Misapplication destroys trust faster than good applications build it.
Example: team adopts AI, PRs up 14%. Sounds like success. Code quality down 9%. Rework up 3x. Effective output flat.
More PRs doesn't mean better. — Yegor Denisov-Blanch, Stanford
Activity metrics lie. Output metrics tell truth. Most organizations measure activity.
Sophistication Disparity
Equal access doesn't create equal value. Some teams use AI with ad hoc prompts. Others build orchestrated workflows with verification and quality gates.
The gap in outcomes? Massive. Same tool, different discipline, completely different results.
Track engineering outcomes with quality guardrails. Business KPIs alone hide the degradation until it's irreversible.
Technical Considerations
- Codebase hygiene as multiplier – Clean architecture, tests, and documentation boost AI utility
- Usage telemetry – Track how AI is applied, not just how often
- Guardrail metrics – Monitor rework, risk, and DevOps health to catch hidden degradation
- Context-aware deployment – Avoid AI in tasks prone to rejection or excessive rewrite
- Retroactive analysis – Use git history and API logs to measure past patterns
Business Impact & Strategy
- Prevent false positives – Avoid misreading activity spikes as productivity gains
- Target high-value contexts – Focus AI on tasks with proven ROI lift
- Address adoption disparity – Close gaps between low- and high-sophistication teams
- Accelerate course correction – Use retroactive measurement instead of long experiments
- Preserve trust – Maintain quality to sustain confidence in AI outputs
Key Insights
- AI productivity gains vary widely across teams
- Quantity of usage is a weak predictor; quality matters more
- Clean codebases amplify AI benefits, technical debt erodes them
- Misapplied AI increases rework and reduces trust
- Equal access doesn’t ensure equal adoption or sophistication
- Outcome-driven metrics with guardrails outperform activity metrics
Why This Matters
AI ROI depends on what you measure. Activity metrics—PR counts, commit frequency—make failure look like success. Quality metrics—rework rates, defect density, review cycles—show the real picture.
Most organizations measure activity. They celebrate velocity while quality collapses. The rework crisis arrives months later, when cause and effect have separated.
Codebase hygiene isn't optional. AI reflects your codebase back at you, amplified. Clean code gets cleaner. Messy code gets unmaintainable. Fix the foundation before deploying AI, or watch technical debt accelerate.
Actionable Playbook
- Assess and improve codebase hygiene: Boost tests, typing, and modularity; track cleanliness index
- Map AI usage patterns: Use telemetry to distinguish low-level from orchestrated workflows
- Implement ROI framework: Primary engineering output metric plus guardrails for quality, rework, and operational health
- Target AI deployment: Train teams to match tasks with AI's strengths
- Retroactively measure: Use git and API data to refine strategy
What Works
Measure outcomes, not activity. Rework rates, defect density, review time. Not PR counts or commit frequency.
Assess codebase hygiene first. Tests, typing, modularity. AI amplifies whatever exists. Start with a mess, end with a bigger mess.
Map usage patterns with telemetry. Who uses AI how? Sophistication variance determines outcome variance. Close the gap between best and worst practices.
Track quality guardrails continuously. Activity metrics spike immediately. Quality metrics lag. By the time rework shows up in data, you've shipped months of problematic code.
Train teams on when not to use AI. Misapplication in critical paths destroys trust. Selective application builds it.
This works when you have baseline measurement infrastructure. Without instrumentation, you're celebrating velocity while quality tanks silently. The cost shows up quarters later, too late to course correct easily.
Full discussion: https://www.youtube.com/watch?v=JvosMkuNxF8.