Building Robust Agentic AI at Enterprise Scale

This article summarizes key ideas from Ben Kus, CTO of Box, discussing their evolution to agentic AI at enterprise scale. Watch the original conversation.

Box manages over an exabyte of unstructured content for 115,000 enterprise customers. When they deployed generative AI features in 2023, off‑the‑shelf LLMs delivered immediate wins — handling simple metadata extraction and document Q&A without custom training. But as CTO Ben Kus discovered, those early successes quickly collided with enterprise reality.

300‑page contracts, multilingual content, OCR imperfections, and high‑stakes accuracy demands exposed the limits of generic models. The answer wasn't to wait for better models, but to change the architecture. From my experience deploying AI at scale, this architectural pivot — moving from monolithic LLM calls to agentic systems — often determines whether enterprise AI projects deliver or stall.

Main Story

Box serves over 115,000 enterprise customers and manages more than an exabyte of unstructured content. Its secure environment made it a natural first home for many customers’ AI experiments. In 2023, Box rolled out generative AI features for search, workflows, and structured data extraction from unstructured sources.

Initial LLM deployments impressed. Generic models beat traditional Intelligent Document Processing tools on flexibility and speed. But scaling to real‑world complexity exposed failures: long documents, hundreds of fields, varied formats, and language diversity caused models to lose track, misinterpret instructions, and deliver inconsistent results.

“We thought LLMs [would solve] everything… until… this just starts to not work.”

Kus and his team identified the problem as architectural, not just a matter of model quality. They pivoted to an agentic framework — AI agents with defined objectives, tool access, memory, and orchestration via directed graphs. This allowed:

Breaking tasks into smaller, related groups of fields
Multi‑pass extraction and iterative refinement
Cross‑model voting for higher confidence
Self‑critique loops that trigger retries before output

The same framework powered new capabilities like “deep research” — orchestrating search, relevance checks, outlining, and synthesis into coherent answers.

A key design was separating the agentic workflow layer from the distributed processing infrastructure. This made it possible to evolve orchestration logic independently from the systems that handle hundreds of millions of documents.

Technical Considerations

For engineering leaders, Box’s experience highlights several constraints and trade‑offs:

Task complexity: Large, multi‑field, multi‑language documents exceed the practical reasoning span of current models
Orchestration: Directed graphs of agents allow explicit dependencies and modular upgrades without re‑architecting the whole system
Model diversity: Running critical steps across multiple vendors reduces single‑point failure risk and improves accuracy
Self‑assessment: Embedding AI‑driven review loops catches errors earlier but adds compute cost and latency
Separation of concerns: Decoupling the intelligence layer from the scaling layer enables independent iteration and performance tuning
Security posture: Maintaining enterprise‑grade privacy and compliance is essential, especially when integrating multiple model providers
Vendor flexibility: Avoiding fine‑tuning locks reduces retraining costs and eases model swapping as the landscape shifts
Integration paths: Exposing agentic capabilities via APIs allows internal teams and customers to build on the same foundation

These choices balance accuracy, speed, and cost. Leaders must weigh latency from multi‑pass and multi‑model approaches against the value of higher confidence in results.

Business Impact & Strategy

Box’s agentic pivot changed outcomes on multiple fronts:

Time‑to‑value: Modular workflows let teams add checks or swap models quickly without full redesigns
Cost control: Vendor‑agnostic orchestration avoids deep coupling with any one model provider
KPIs: Accuracy in complex extraction tasks improved through decomposition, voting, and retries
Org design: Training teams in “agentic first” thinking enabled faster internal innovation and richer customer‑facing APIs
Risk management: Separation of intelligence and scale layers reduced the blast radius of failures and simplified scaling decisions

For leaders, the takeaway is strategic flexibility. An agentic architecture supports evolving use cases, vendor changes, and new AI capabilities without locking into brittle designs.

Key Insights

Early LLM success can mask fragility in complex, high‑accuracy tasks
Architectural shifts can matter more than waiting for better models
Agentic frameworks enable decomposition, orchestration, and self‑critique
Separating logic from scale infrastructure preserves agility
Multi‑model strategies improve resilience and accuracy

Why It Matters

For technical teams, agentic design is a hedge against the unpredictability of model progress. It creates room to adapt workflows without tearing down the system. For business leaders, it safeguards investments and shortens the path from idea to production‑grade capability.

In enterprise AI, complexity arrives sooner than expected. Building with agents from the start can turn that complexity into a manageable, evolvable asset.

Conclusion

Box's move from simple LLM use to a robust agentic platform shows how to meet enterprise AI's dual demands: flexibility and accuracy at scale. The lesson is clear — if AI might help, design agentically from day one. Ben Kus shares more details and technical insights in the full conversation.

Questions or feedback? Reach out!

Main Story

Technical Considerations

Business Impact & Strategy

Key Insights

Why It Matters

Conclusion

Related Articles

2025: The Year AI Evaluation Goes Board-Level

Building AI Agents with Model Context Protocol

Building AI Agents with a True Cognitive Core

Explore by Topic

enterprise-AI(2 articles)