Logo
Published on

Building Robust Agentic AI at Enterprise Scale

Authors
  • avatar
    Name
    Ptrck Brgr
    Twitter

Box’s journey into generative AI began with promising results from off‑the‑shelf LLMs. These models handled simple metadata extraction and document Q&A better than past ML systems — without custom training. But enterprise reality soon tested their limits.

Ben Kus, Box’s CTO, explains how early successes collided with the demands of 300‑page contracts, multilingual content, OCR imperfections, and high‑stakes accuracy. The answer wasn’t to wait for better models, but to change the architecture.

Main Story

Box serves over 115,000 enterprise customers and manages more than an exabyte of unstructured content. Its secure environment made it a natural first home for many customers’ AI experiments. In 2023, Box rolled out generative AI features for search, workflows, and structured data extraction from unstructured sources.

Initial LLM deployments impressed. Generic models beat traditional Intelligent Document Processing tools on flexibility and speed. But scaling to real‑world complexity exposed failures: long documents, hundreds of fields, varied formats, and language diversity caused models to lose track, misinterpret instructions, and deliver inconsistent results.

“We thought LLMs [would solve] everything… until… this just starts to not work.”

Kus and his team identified the problem as architectural, not just a matter of model quality. They pivoted to an agentic framework — AI agents with defined objectives, tool access, memory, and orchestration via directed graphs. This allowed:

  • Breaking tasks into smaller, related groups of fields
  • Multi‑pass extraction and iterative refinement
  • Cross‑model voting for higher confidence
  • Self‑critique loops that trigger retries before output

The same framework powered new capabilities like “deep research” — orchestrating search, relevance checks, outlining, and synthesis into coherent answers.

A key design was separating the agentic workflow layer from the distributed processing infrastructure. This made it possible to evolve orchestration logic independently from the systems that handle hundreds of millions of documents.

Technical Considerations

For engineering leaders, Box’s experience highlights several constraints and trade‑offs:

  • Task complexity: Large, multi‑field, multi‑language documents exceed the practical reasoning span of current models
  • Orchestration: Directed graphs of agents allow explicit dependencies and modular upgrades without re‑architecting the whole system
  • Model diversity: Running critical steps across multiple vendors reduces single‑point failure risk and improves accuracy
  • Self‑assessment: Embedding AI‑driven review loops catches errors earlier but adds compute cost and latency
  • Separation of concerns: Decoupling the intelligence layer from the scaling layer enables independent iteration and performance tuning
  • Security posture: Maintaining enterprise‑grade privacy and compliance is essential, especially when integrating multiple model providers
  • Vendor flexibility: Avoiding fine‑tuning locks reduces retraining costs and eases model swapping as the landscape shifts
  • Integration paths: Exposing agentic capabilities via APIs allows internal teams and customers to build on the same foundation

These choices balance accuracy, speed, and cost. Leaders must weigh latency from multi‑pass and multi‑model approaches against the value of higher confidence in results.

Business Impact & Strategy

Box’s agentic pivot changed outcomes on multiple fronts:

  • Time‑to‑value: Modular workflows let teams add checks or swap models quickly without full redesigns
  • Cost control: Vendor‑agnostic orchestration avoids deep coupling with any one model provider
  • KPIs: Accuracy in complex extraction tasks improved through decomposition, voting, and retries
  • Org design: Training teams in “agentic first” thinking enabled faster internal innovation and richer customer‑facing APIs
  • Risk management: Separation of intelligence and scale layers reduced the blast radius of failures and simplified scaling decisions

For leaders, the takeaway is strategic flexibility. An agentic architecture supports evolving use cases, vendor changes, and new AI capabilities without locking into brittle designs.

Key Insights

  • Early LLM success can mask fragility in complex, high‑accuracy tasks
  • Architectural shifts can matter more than waiting for better models
  • Agentic frameworks enable decomposition, orchestration, and self‑critique
  • Separating logic from scale infrastructure preserves agility
  • Multi‑model strategies improve resilience and accuracy

Why It Matters

For technical teams, agentic design is a hedge against the unpredictability of model progress. It creates room to adapt workflows without tearing down the system. For business leaders, it safeguards investments and shortens the path from idea to production‑grade capability.

In enterprise AI, complexity arrives sooner than expected. Building with agents from the start can turn that complexity into a manageable, evolvable asset.

Conclusion

Box’s move from simple LLM use to a robust agentic platform shows how to meet enterprise AI’s dual demands: flexibility and accuracy at scale. The lesson is clear — if AI might help, design agentically from day one.

Watch the full conversation with Ben Kus here: https://www.youtube.com/watch?v=12v5S1n1eOY