What Robotics Taught Me About AI Agents

The model is 1% of the work. The other 99%? Infrastructure, simulation, logging—the boring engineering that determines whether your agent works in production or just looks good in demos.

I spent years building environment perception and SLAM for autonomous vehicles before moving to enterprise AI agents. Same problems. Different domain. Teams are repeating mistakes autonomous driving made a decade ago—and the robotics community already documented every expensive lesson.

Agent developers don't need to relearn this from scratch. Copy the playbook. Infrastructure quality separates agents that ship from agents that spiral endlessly.

The Perception Trap

We nailed perception in autonomous driving years ago. Object detection? Solved. Lane tracking? Reliable. Semantic segmentation? Production-ready. The car sees everything it needs to see. Planning and execution in unpredictable environments? That's where systems still break.

Same pattern with AI agents. Reasoning capabilities are strong—models understand code, analyze requirements, generate solutions. Acting reliably in messy production systems? That's the bottleneck.

Perception looks like the hard problem when you start. It's not. Execution is. A self-driving car that perfectly detects a pedestrian but freezes when deciding whether to brake is useless. An agent that understands your entire codebase but can't deploy a config change without breaking dependencies is equally useless.

Closed-Loop Control: Why Agents Fail Silently

Robotics 101: measure, act, measure again. In SLAM systems, the vehicle constantly checks its position estimate against new sensor readings and corrects drift. Control loops run at 10Hz, 50Hz, sometimes faster—continuous verification and correction.

Most agents run open-loop. Execute a command, assume it worked, move on. No verification. No correction. The agent operates blind to its own failures.

I've seen this kill agent deployments repeatedly. One team added a simple verification step—check if the service actually responds after config changes. Failure rate dropped 70%. That's closed-loop control. Do something. Verify it worked. Correct if it didn't.

Open-loop agents cascade failures catastrophically. One bad action poisons everything downstream. The agent doesn't know it's operating on corrupted state. It keeps executing confidently, compounding damage with every step. Closed-loop catches the first failure before it spreads.

Time Discretization: Turn-Based vs Real-Time

Autonomous systems sample continuously. Perception at 10Hz. Planning at 5Hz. Control at 100Hz. Constant replanning as the environment changes. This handles dynamic scenarios—pedestrians stepping into crosswalks, vehicles cutting in, road conditions shifting.

Most agents run turn-based. Think, act, wait for response, think again. Simple. Clean. But it locks you into synchronous workflows where nothing happens in parallel.

Async events get missed completely. Database query timing out while the agent processes something else? Invisible. Service degrading gradually in the background? No signal reaches the agent until it explicitly checks.

This design choice fundamentally constrains which problems agents can solve. Real-time system monitoring where you need to react to anomalies as they happen? Hard with turn-based agents. Batch ETL jobs with clear sequential steps? Perfect fit.

Action Granularity Determines Capability

In robotics, action granularity defines capability limits. Steering angle commands every 10ms give you lane-keeping. Torque commands every millisecond let you handle emergency maneuvers. The finer the control, the more complex scenarios you can handle—and the harder the planning problem becomes.

Agents face the same tradeoff. Most work at API-call granularity. Coarse. Reliable. Limited. Want finer control over systems that don't expose clean APIs? Drop to character-level terminal IO, frame-based GUI manipulation, byte-level file editing.

More granularity means more flexibility. Also exponentially more planning complexity. Deciding which 5 API calls to make is tractable. Deciding which 500 terminal keystrokes to send? Planning search space explodes.

The choice depends entirely on your problem domain. High-level workflow orchestration? Coarse API granularity works fine. Low-level system control where no APIs exist? You need fine-grained actions and the infrastructure to handle that complexity.

The Statefulness Problem

Stateless agents reset every interaction. Clean slate. No memory. Great for isolated tasks like answering questions or generating code snippets. Terrible for ongoing operational work where context matters—deployments spanning hours, multi-step migrations, system monitoring over days.

Stateful agents maintain context across sessions. What systems are running. What data exists. What changed since last check. This matches how autonomous vehicles work—the SLAM system doesn't reset its map every second, it updates continuously based on new observations.

Simulation complexity explodes with statefulness. Testing a stateless agent? Spin up clean environment, run test, tear down. Simple. Repeatable. Testing a stateful agent? You need realistic starting conditions that match production diversity.

Production databases with real data distributions, not empty test schemas. Services in various states—some healthy, some degraded, some failing intermittently. Ongoing processes mid-execution. Partial state from previous agent runs.

Most agent simulations ignore this completely. They test against fresh environments every time. Then production surprises them with state they never simulated. The agent encounters a scenario it's never seen—half-completed migration, stale cache, inconsistent replica—and fails in ways simulation never caught.

Technical Considerations

Closed-loop control for agent actions to enable mid-process corrections
Realistic simulation environments with persistent state for accurate testing
Expanded action granularity to increase flexibility in execution
Detailed logging and categorization of failures for targeted iteration
Tooling and infrastructure that enable rapid offline experimentation

Business Impact & Strategy

Faster iteration cycles through robust offline tooling
Lower failure rates by integrating feedback loops into execution
Reduced deployment risk via realistic simulation and logging
Broader capability scope from expanded action spaces
Improved production readiness with targeted failure analysis

Key Insights

Infrastructure and tooling often outweigh model performance in real-world agents
Closed-loop feedback reduces cascading failures and improves resilience
Time discretization shapes responsiveness and problem scope
Broader action spaces increase flexibility but add complexity
Stateful agents require richer simulations and context handling
Transitioning to action models introduces new failure modes needing recovery strategies

Why This Matters

Stop benchmarking models. Start auditing infrastructure. I've seen this pattern repeatedly: teams with weaker models and solid simulation infrastructure ship reliable agents. Teams with frontier models and no logging spiral for months debugging production failures they can't reproduce.

Model providers compete on reasoning benchmarks. Your differentiation isn't which model you picked—it's operational readiness. How fast you iterate when agents fail in production. How quickly you recover from errors. Whether you catch problems in simulation before they reach customers.

The economics are completely backwards from what most teams assume. They spend months evaluating models, comparing benchmarks, running bake-offs. Then they spend two days building simulation and logging infrastructure before deploying to production.

Should be the opposite. Model selection takes an afternoon—pick the frontier model that fits your budget and latency constraints. Building reliable simulation, logging, and feedback systems? That takes months. Testing against realistic production state? Even longer. That's where the actual work lives.

This mirrors what we learned in autonomous driving. The perception model matters. But the simulation infrastructure that lets you test against millions of scenarios before deploying to real vehicles? That's what separates systems that ship safely from systems that cause accidents.

Actionable Playbook

Audit full agent stack: Map tools, APIs, simulation, retraining, and monitoring; identify offline tooling gaps that slow iteration
Implement closed-loop control: Add feedback mechanisms for critical actions; track reduction in failure cascades
Expand action granularity: Pilot finer control interfaces; measure capability gains in complex tasks
Simulate realistic starting states: Include persistent data and ongoing processes; validate against production conditions
use detailed failure logging: Categorize failures; prioritise fixes that improve simulation fidelity

What Works

We spent decades learning these lessons in autonomous systems. Agent developers don't need to repeat that expensive education.

Add closed-loop verification everywhere. After every critical agent action, verify it worked. Check that the service responds. Confirm the config took effect. Validate the deployment succeeded. Sounds obvious. Most agents skip this entirely—they execute and hope.

Build simulation that matches production reality. Not toy examples with clean data. Real production data distributions. Real failure modes—timeouts, partial failures, degraded services. Real ongoing processes and persistent state. Your agent will find every gap between simulation and production. Better to find them in testing where failures are cheap.

Log everything with structured categorization. Raw logs are noise that nobody reads. Categorize failures by type—wrong output, timeout, dependency unavailable, permission denied, state corruption. Patterns emerge. You fix them systematically instead of playing whack-a-mole with symptoms.

Match sampling frequency to your problem domain. Turn-based execution works fine for batch processing and sequential workflows. Real-time problems—system monitoring, anomaly detection, dynamic resource allocation—need continuous sampling and async event handling. Design your agent's time discretization to match what the domain actually requires.

Expand action granularity only when coarse control fails. API-level control is simpler and more reliable. Drop to finer granularity—terminal IO, GUI manipulation, byte-level file operations—only when you hit tasks that APIs fundamentally can't handle. More control means exponentially more complexity. Take that tradeoff consciously.

The hard part isn't building agents that work in demos with clean inputs. It's building agents that keep working when production throws chaos—partial failures, inconsistent state, race conditions, cascading errors. Autonomous systems figured this out. Copy the infrastructure playbook, skip the decades of expensive failures.

The Perception Trap

Closed-Loop Control: Why Agents Fail Silently

Time Discretization: Turn-Based vs Real-Time

Action Granularity Determines Capability

The Statefulness Problem

Technical Considerations

Business Impact & Strategy

Key Insights

Why This Matters

Actionable Playbook

What Works

Related Articles

Demis Hassabis on AI, Simulation, and AGI's Next Steps

Stop Building Agents, Start Building Skills

2025: The Year AI Evaluation Goes Board-Level

Explore by Topic

agentic-ai(1 articles)

simulation(1 articles)