How Amazon Kiro Turns Prompts Into Verifiable Specs

"Done" is not a verification strategy. Every developer who's used an AI coding agent knows the pattern: the LLM generates code, declares victory, moves on. Whether the code actually does what you asked? That's a feeling, not a fact.

Al Harris at Amazon explains in Spec-Driven Development: Agentic Coding at FAANG Scale and Quality—the fix isn't better models. It's structured requirements that survive the chat window and connect to falsifiable correctness checks.

It connects to something I keep circling back to. Speed without guardrails is technical debt at 10x velocity. At Tier, we shipped edge ML fast—constraints bred creativity—but only because we had hard invariants to satisfy. Remove those, and you're just generating plausible code. Harris is building the guardrails into the IDE itself.

The Operator Bottleneck

We took a look at some existing stuff out there and said, "Hey, vibe coding is great, but vibe coding relies a lot on me as the operator getting things right." — Al Harris, Amazon

Vibe coding puts the human as the quality gate. You write the prompt, check the output, decide "good enough." Works for prototypes. Collapses for anything with real requirements—and I've watched this play out where there's no spec, no acceptance criteria, just a prompt and a prayer.

Kiro's answer: a structured workflow that front-loads thinking. You give it a prompt. It generates requirements with acceptance criteria in EARS—"easy approach to requirement syntax"—structured natural language designed to be machine-parsable. Not a plan the LLM forgets three turns later. A living artifact.

EARS to Invariants: The Pipeline Nobody Built

What I find genuinely fascinating is the next step. EARS requirements translate directly into correctness properties—invariants your system must satisfy.

Now your EARS requirements can be translated directly into properties of the system which are effectively invariants that you want to deliver. — Al Harris, Amazon

If you've used Hypothesis in Python or fast-check in Node, you know property-based testing. Define properties the system must hold, and the framework tries to falsify them. No counterexample? Confidence the system works. Harris connects structured natural-language requirements to this paradigm automatically.

That pipeline—prompt to EARS to properties to property-based tests—is what I haven't seen anyone else build. Most AI coding tools stop at "generate code and unit tests." I keep coming back to this: why did it take this long to connect requirements engineering to AI code generation?

The Confidence Problem

But here's the catch. Harris is honest about the limits.

He describes property-based testing as giving you a "high degree of confidence"—then adds that "high" is doing "a little bit of heavy lifting." Write weak properties, get weak guarantees. Same garbage-in-garbage-out, just shifted upstream.

I'm not convinced this is fully solved. My sample size is limited, but the pattern I keep seeing: the hard part isn't generating the right artifacts—it's maintaining them. Specs drift. Requirements go stale. Property tests get ignored when deadlines hit. Harris talks about living documentation, not point-in-time snapshots. But organizational discipline determines whether that happens.

Less Context, Better Tools

One operational insight surprised me. The agent performs better with less context—not more.

The agent does better when given less context but given the tools to understand where to go find things. — Al Harris, Amazon

They prioritize prompt caching hit rate, achieving 90-95% cache token usage per turn. Cold-sending 160k tokens is slow and expensive. Current summarization at the context cap takes 30-45 seconds—they're actively fixing it.

Conventional wisdom says more context is better. Harris flips that. Give the agent tools to self-discover what it needs instead of loading everything upfront. Incremental disclosure beats brute-force context loading (and this sounds obvious in retrospect, but most tools still default to cramming).

Brownfield Reality

Harris gives a pointed anecdote about brownfield. If your codebase has strong separation of concerns—cohesive modules, reliable tests—the agent does well. If you "were trying to launch an IDE very quickly and took a lot of tech debt along the way"—his words, with a knowing grin—the agent struggles like a new developer would.

Same pattern I observe in enterprise deployments. Clean codebases amplify AI. Tech debt? AI accelerates the entropy. Specs don't fix bad architecture—they make requirements explicit enough to see where things break before the agent starts generating.

Not Just an LLM Wrapper

Worth tracking: Harris says Kiro's backend isn't just an LLM with a workflow bolted on. They use "classic automated reasoning techniques" for requirements verification—scanning for ambiguity, conflicting constraints, invalid requirements. The goal is to use the LLM "less and less over time."

I don't have enough data to evaluate that claim yet. But if it holds—neurosymbolic reasoning for verification, LLMs for generation—that's a fundamentally different architecture from every other AI coding tool I've seen.

Why This Matters

The gap between "AI-generated code" and "code I'd ship to production" is a verification gap. Not a model quality gap. Not a prompt engineering gap. A verification gap.

Kiro's bet is that structured requirements plus property-based testing can close it. Prompt to EARS to invariants to falsifiable tests. Every step produces an auditable artifact connecting back to the original intent.

Whether this works at real scale—messy codebases, deadline pressure, teams that skip steps—remains open. The tooling looks right. Organizational discipline to use it? That's always been the harder problem.

What Works

Front-load structure before code generation. Ten seconds on a prompt gets proportional accuracy. Ten minutes on requirements with acceptance criteria gets something worth building from.

Use EARS or similar structured formats for requirements. Natural language is fine for conversation, terrible for verification. Structured natural language gives you both.

Add property-based tests as your verification layer. Unit tests check specific cases. Properties check invariants across the input space. Different in kind.

Keep agent context small. Give it tools to discover what it needs rather than loading everything upfront. Prompt caching hit rate matters more than context window size.

Caveat: all of this works best on clean, well-structured codebases with good test coverage. Brownfield projects with heavy tech debt will still struggle—the spec can't fix what the architecture breaks. Know which situation you're walking into.

Full talk: Watch on YouTube

The Operator Bottleneck

EARS to Invariants: The Pipeline Nobody Built

The Confidence Problem

Less Context, Better Tools

Brownfield Reality

Not Just an LLM Wrapper

Why This Matters

What Works

Related Articles

Architecture Decisions Drive 100x More Cost Than Code

Stanford Tracked 120K Devs. AI ROI? Just 10%

Why CLI Agents Beat Every IDE

Explore by Topic

ai-coding(3 articles)

mlops(2 articles)

enterprise-ai(3 articles)