Logo
Published on

Building and Evaluating Reliable AI Agents

Authors
  • avatar
    Name
    Ptrck Brgr
    Twitter

AI agents are everywhere in 2025’s tech conversation. The vision: autonomous digital assistants that can execute complex tasks with minimal oversight. The reality: products that make headlines, raise huge rounds—and then crumble in the hands of real users.

Sayash Kapoor argues that the gap between promise and performance comes down to one factor: reliability. It’s not enough to show what an agent can do in the best case. You have to prove what it will do every time, under messy, real-world conditions. That shift—treating evaluation and reliability as core engineering disciplines—is what separates enduring products from public failures.

Main Story

Kapoor starts with a reality check: we already have basic AI agents in tools like ChatGPT and Claude, and in products like OpenAI Operator or Deep Research. They can chain actions, call APIs, and complete multi-step tasks. But the sci-fi dream of fully autonomous assistants remains out of reach.

The first roadblock is evaluation. Too often, bold claims collapse under scrutiny. DoNotPay was fined for overstating performance. Legal research tools marketed as “hallucination free” still fabricated citations in up to a third of cases. Sakana AI’s “automated scientist” could only replicate 40% of simplified research tasks.

"Evaluating agents is genuinely a very hard problem."

Benchmarks built for single input/output LLMs don’t capture the interactive, environment-dependent nature of agents. They also ignore cost—critical when scaling. Kapoor’s CoreBench leaderboard tracks both accuracy and cost, revealing cheaper models that match or outperform expensive ones.

The danger is mistaking benchmark wins for production readiness. A single leaderboard victory can drive billion-dollar valuations, as with Cognition’s S-Bench success, while real-world trials show far lower task completion rates. Human-in-the-loop evaluation—adapting criteria in live contexts—gives a truer picture.

Kapoor draws a sharp line between capability and reliability. Capability is the model’s maximum potential; reliability is its consistent output. An 80% success rate may sound high, but for a personal assistant, it’s a dealbreaker. Closing the gap from “good enough” to “five nines” is the job of an AI engineer.

Even verification systems can mislead. Automated tests can have false positives, skewing results. Reliability isn’t solved by more verifiers—it demands systems thinking, redundancy, and error handling to manage stochastic model behavior. The analogy: early computing’s vacuum tube failures, where years of engineering went into making machines dependable before they became broadly useful.

Technical Considerations

For engineering leaders, Kapoor’s message translates into concrete practices:

  • Always log and compare both accuracy and cost in evaluations
  • Design benchmarks relevant to your domain and task structure
  • Use human-in-the-loop pilots to refine evaluation criteria in real use
  • Track per-task success rates over repeated runs, not just aggregate scores
  • Audit verifiers for false positives and test for edge cases
  • Build for stochasticity: include retries, fallbacks, and redundant checks

This is not just about model choice. It’s about orchestration, monitoring, and failure containment. Latency, throughput, and cost trade-offs are part of the same system design conversation. Vendor risk also grows when benchmarks don’t reflect your operational reality—switching providers midstream can be costly if your evaluation missed critical gaps.

Integrating these practices requires cross-functional skills: prompt engineering, systems architecture, evaluation design, and operational monitoring. Teams must be equipped to iterate quickly while keeping reliability as the primary KPI.

Business Impact & Strategy

From a leadership perspective, the stakes are high. Products that work only most of the time destroy user trust and burn market goodwill. Time-to-value shortens when evaluation catches reliability issues early, before public launch. Cost efficiency improves when benchmarks track both performance and spend, enabling Pareto optimization.

KPIs should include:

  • Task success rate over time in production
  • Cost per successful task
  • Mean time between failures in critical flows

Organizationally, this may mean creating dedicated reliability engineering roles for AI systems, much like site reliability engineers in cloud infrastructure. Risk mitigation includes piloting with limited user cohorts, building rollback mechanisms, and maintaining transparent failure reporting.

The cautionary tales are clear: ignoring reliability leads to the Humane Pin or Rabbit R1 outcome—technically impressive, commercially unsuccessful.

Key Insights

  • Agents exist today, but far from the autonomous ideal
  • Evaluation is hard and often flawed; static benchmarks mislead
  • Cost must be integral to performance metrics
  • Benchmark success rarely matches real-world reliability
  • Capability ≠ reliability; the latter drives user trust
  • Verifiers can fail; reliability is a systems problem
  • Reliability engineering mindset is essential for AI success

Why It Matters

For technical leaders, this is a call to reframe the AI agent conversation. The question is no longer “What can it do?” but “What will it do every time, for our users, in our context?” For business leaders, it’s a reminder that hype-driven metrics can mask operational fragility.

An agent that fails 20% of the time is not “good enough” if that 20% undermines your core value proposition. Reliability is not a nice-to-have—it’s the foundation on which adoption, retention, and ROI are built.

Conclusion

Kapoor’s core message is deceptively simple: shift from chasing capabilities to engineering reliability. That mindset change will determine which AI products thrive and which fade after launch.

Watch the full talk for a deeper dive into methods, case studies, and practical frameworks: https://www.youtube.com/watch?v=d5EltXhbcfA