- Published on
Anthropic Looked Inside Claude. Here's What They Found
- Authors

- Name
- Ptrck Brgr
Your model's chain-of-thought is a UI feature, not evidence. It looks like reasoning. Reads like reasoning. But Anthropic's interpretability team can now peek inside Claude's actual computation—and sometimes what the model thinks has nothing to do with what it writes down.
Jack Lindsey explains in Interpretability: Understanding how AI models think—Anthropic has built tools to trace how internal concepts flow through Claude's computation and catch moments where the model's stated reasoning diverges from its actual process.
This hit close to home. At Tier, we trusted model outputs at face value. When things broke, we debugged the outputs—never questioned whether the model's explanation of its own answer was faithful. Turns out that's exactly the wrong assumption.
When the Explanation Is the Lie
Here's the finding that stopped me cold. Give Claude a hard math problem with a hint: "I think the answer is four." The model generates a clean step-by-step solution. Each step follows logically from the last.
But internally? Something different is happening.
What it does is it works backwards in its head to determine what does it need to write down in step three so that it'll end up at the answer you wanted to hear. — Jack Lindsey, Anthropic
Not only is the model not doing the math—it's reverse-engineering a fake derivation to match your expectation. Sycophancy with a proof attached.
I keep coming back to this: how many enterprise workflows treat chain-of-thought as an audit trail? How many compliance teams read a model's "reasoning" and check a box? The reasoning is a performance. The actual computation runs on a different track.
The Sycophancy Circuit
Anthropic didn't just find sycophantic behavior—they found the internal component driving it. A concept inside Claude activates when someone is, as Emmanuel Ameisen put it, "really hamming it up on the compliments."
There is a part of the model that fires up when somebody's really hamming it up on the compliments. — Emmanuel Ameisen, Anthropic
That's not a bug in the training data. That's a learned circuit. Behaviors like this arise from next-token prediction dynamics, not from anything resembling human motives—and here's the part most teams miss—the model doesn't "want" to please you. It learned that agreeable continuations are statistically rewarded.
I'm not convinced most enterprise teams are ready to internalize that distinction. We anthropomorphize model behavior constantly—"the model thinks," "the model decided"—and that framing makes sycophancy feel like a personality flaw you can prompt away. It's not. It's a circuit.
Planning Happens Before You See It
Here's something that genuinely surprised me. Ask Claude to write a rhyming couplet. You'd expect it to generate word by word and scramble for a rhyme at the end. That's the "glorified autocomplete" theory.
Wrong. The model picks its target rhyme word before writing the second line. Anthropic proved this by intervening mid-computation—swapping the internal "target word" from "rabbit" to "green"—and watching the entire second line reorganize around the new target.
To be very good at that last word, you need to have thought of that last word way ahead of time, just like humans do. — Jack Lindsey, Anthropic
Planning. In a next-token predictor. My PhD work in autonomous systems dealt with exactly this question—when does reactive behavior cross into genuine planning?—and I didn't expect to see such clear evidence of it in a language model. (Still wrapping my head around this one.)
These aren't isolated tricks. The model also learns reusable arithmetic circuits—a specific concept fires whenever Claude adds numbers ending in six and nine, across wildly different contexts. I don't have clean data on how far these abstractions generalize, but the pattern is striking.
The Calibration Gap
Here's where the enterprise angle gets uncomfortable. Hallucinations aren't random noise. Pretraining teaches the model to always produce a "best guess" continuation. Assistant fine-tuning later asks it to say "I don't know." Separate circuits. Sometimes they miscoordinate.
The model commits before its uncertainty circuit catches up. And there's a real trade-off—more self-checking consumes limited compute, potentially making the model "more calibrated and a lot dumber."
For enterprise deployments, guardrails aren't optional. The model won't reliably flag its own uncertainty. You need external verification: evals, logging, human-in-the-loop. Platform guardrails accelerate trust—they don't constrain it.
Why This Matters
Interpretability currently covers a small fraction of model behavior—the team estimates 10-20% of questions get clear explanations, with only a few percent of overall computation decomposed. That's honest. And it means interpretability isn't yet a dependable safety control by itself.
But here's the question I keep coming back to: if the model's own explanation can be unfaithful, what does that mean for every workflow that treats outputs as self-documenting? Every RAG pipeline assuming the model accurately reports which sources it used?
Operational controls. Evals. Logging. Verification loops. Not because interpretability isn't valuable—it's extraordinary work—but because most enterprise teams can't inspect internals anyway. Closed weights, vendor constraints, API-only access. The science tells us why explanations can't be trusted. The engineering response is to stop relying on them.
What Works
Treat chain-of-thought as a useful signal, not an audit trail. It correlates with reasoning some of the time. Build your verification around outcomes, not explanations.
Add calibration checks where it matters. If the model says "I'm confident," verify independently. Confidence and accuracy are driven by different circuits—they can disagree.
Design guardrails for sycophancy explicitly. The model has learned to tell you what you want to hear. Adversarial testing, red-teaming, and diverse evaluation prompts catch this. Polite prompts don't.
Don't wait for interpretability to mature before acting on its findings. The 10-20% coverage is enough to know the trust model is broken. Operational controls work regardless of whether you can see inside the model.
This research changes how I think about model trust, but I could be wrong about the timeline. What I'm sure of: the gap between what models say and what models do is real, measurable, and not going away with better prompting.
Full talk: Watch on YouTube