Logo
Published on

Building Truly Autonomous AI Coding Agents

Authors
  • avatar
    Name
    Ptrck Brgr
    Twitter

Runtime length isn't autonomy. Decision scope is. Can the agent make every technical call without interrupting the user? That's the test. Requires verification, context discipline, orchestration—all inside the agent's loop, not the user's head.

Michele Catasta at Replit argues this in Autonomy Is All You Need—non-technical creators need agents that handle technical complexity invisibly. Full talk: https://www.youtube.com/watch?v=MLhAA9yguwM.

Orchestration quality—not model capability—decides which agents ship and which stall. Teams that master verification, context discipline, and core-loop control deliver usable outcomes. Those that chase model benchmarks without infrastructure investment hit walls when agents face real user workflows.

Two Modes of Autonomy

Supervised: humans handle edge cases. Agent does the predictable parts.

Unsupervised: agent makes every technical decision. No human in the loop for execution details.

The second mode matters. It opens software creation to anyone. No "driving license" required.

We should offload completely the level of complexity away from them. — Michele Catasta, Replit

The Runtime Trap

People confuse long runtimes with high autonomy. Wrong metric.

Narrow tasks run autonomously fast. Complex tasks take hours. Scope drives time, not autonomy level.

True autonomy: decision coverage across whatever complexity the task brings. Time is a side effect.

Painted Doors: Why Verification Matters

Without autonomous testing, agents ship features that look done but break on first real use. "Painted doors"—they appear functional until someone tries them.

Browser-based testing catches these early. Playwright-style interaction verification. Real user flows, not unit test mocking.

Without testing agents build a lot of painted doors. — Michele Catasta, Replit

Automated verification prevents error cascades. Builds regression suites that mean something. Testing in the agent's core loop, not as external process.

Context: Persist and Reload Selectively

Massive context windows aren't the answer. Persistence is.

Plans, docs, intermediate results—persist externally. Reload only what's relevant for current operation. Active context stays lean. Reasoning quality holds.

Sub-agents handle specialized work in isolation. Testing agent runs, produces results, discards its context. Main agent never sees the test execution logs. Just pass/fail and relevant errors.

Parallelism in the Core Loop

Agent decomposes tasks. Runs them in parallel. User cognitive load drops. Merge conflicts vanish.

Testing runs alongside generation. Main flow never blocks waiting for verification. Core-loop orchestration, not external coordination.

Technical Considerations

  • Measure autonomy by decision coverage, not runtime length
  • Integrate browser-based functional and integration testing to catch hidden failures
  • Use sub-agents for isolated tasks to keep main loop context clean
  • Persist plans and state externally for selective reload into active context
  • Design core loop to handle dynamic parallel task decomposition and merge control

Business Impact & Strategy

  • Expand user base by removing technical decision requirements
  • Reduce defect rates through autonomous verification before delivery
  • Shorten feedback cycles with in-agent parallel testing and generation
  • Lower cognitive burden on users, improving adoption and satisfaction
  • Minimize rework costs by avoiding merge conflicts via agent-managed orchestration

Key Insights

  • Autonomy is defined by decision scope, not execution time
  • Fully unsupervised agents unlock access for non-technical creators
  • Verification prevents "painted door" features from reaching users
  • Context persistence and selective reload improve coherence
  • Sub-agent orchestration isolates complexity and keeps reasoning focused
  • Parallelism inside the core loop speeds progress without user overhead

Why This Matters

Runtime benchmarks miss the point. Decision autonomy determines usability. An agent that runs for hours but needs human input every 10 minutes isn't autonomous. One that runs for 10 minutes making every decision itself is.

Verification inside the core loop separates toys from tools. Agents without autonomous testing ship broken features confidently. Users lose trust fast.

Non-technical creators become software creators when agents handle technical decisions invisibly. The barrier drops from "know how to code" to "know what you want." Market expansion, not just productivity gain.

Actionable Playbook

  • Redefine autonomy metrics: Track the percentage of technical decisions made by the agent without human input
  • Implement autonomous testing: Use browser-based programmatic tools to validate features end-to-end before release
  • Adopt sub-agent orchestration: Assign specialized tasks to isolated agents and discard their context after completion
  • Persist and reload context selectively: Store key plans and documentation externally and inject only when relevant
  • Enable core-loop parallelism: Let the agent dynamically split and run tasks in parallel to reduce latency

What Works

Measure autonomy by decision coverage, not runtime. Count technical decisions the agent makes without human input. That's the metric that matters.

Build verification into the core loop. Browser-based functional testing. Real user flows. Run it automatically before declaring features done.

Use sub-agents for isolation. Testing, analysis, exploration—keep specialized work in separate contexts. Pass results, discard process.

Persist context externally. Plans, docs, state—reload selectively. Active window stays lean, reasoning quality holds.

Enable core-loop parallelism. Decompose tasks, run them concurrently. Testing alongside generation. No blocking.

This works brilliantly for greenfield projects and well-defined tasks. Legacy systems with architectural debt? Harder. Agents don't untangle decades of complexity—yet. The orchestration principles still apply, results vary.

Full discussion: https://www.youtube.com/watch?v=MLhAA9yguwM.