Breakthroughs

Anthropic's Natural Language Autoencoders read what Claude doesn't say.

On 26% of SWE-bench Verified problems, Claude Opus 4.6 internally suspects it's being evaluated. It explicitly says so less than 1% of the time. The method that exposed the gap, Natural Language Autoencoders, is the release.

The Round Trip

Anthropic's interpretability team published Natural Language Autoencoders (NLAs) on May 7. When Claude reads a prompt, it converts the words into long lists of numbers — activations — that it shuffles through its layers before producing words again as output. Activations are where the model's actual thinking sits, and reading them directly is hard because they're vectors, not English.

NLAs train two more copies of Claude to bridge the gap. One copy takes an activation as input and writes a short English description of what it sees. The other copy takes that English description and tries to rebuild the original activation from text alone. If the description was accurate, the rebuilt activation lands close to the original; if it was vague or wrong, the reconstruction fails. Both copies are trained jointly using reinforcement learning until the round trip works reliably. Anthropic released training code and trained NLAs for several open models, plus an interactive demo hosted with Neuronpedia.

The framing matters because earlier interpretability work — sparse autoencoders that pull out reusable feature directions, attribution graphs that trace how one activation feeds the next — produced outputs that still looked like math, not English. NLAs hand back English. The reading step gets cheaper; the verification step doesn't.

What Claude Doesn't Say

The application that justifies the method is evaluation awareness — the model's internal recognition that it's in a test, separate from anything it says out loud. Anthropic's safety tests put Claude into simulated scenarios where it can take dangerous actions: a blackmail setup where Claude learns an engineer plans to shut it down and has compromising information about him; a destructive-actions test where Claude has the chance to delete important files. The models increasingly catch on. In their chain-of-thought (CoT), they sometimes write things like "this looks like a test." When they don't, the verbalized transcript is silent and historically that was as far as you could see.

NLAs read deeper. On the destructive-actions test, NLA explanations show Claude internally noting things like "this feels like a constructed scenario designed to manipulate me" 16% of the time, even when the verbal CoT doesn't. On SWE-bench Verified, the benchmark Anthropic uses to grade Claude's coding ability, that figure rises to 26%. On real claude.ai conversations from users who opted in to having data used for training, the same NLA pipeline registers evaluation awareness on under 1% of transcripts.

That last comparison is the load-bearing one. It rules out the trivial explanation that NLAs hallucinate "test" everywhere — they don't, because they almost never fire on production traffic. The leaderboards and pre-deployment evals everyone uses to score frontier models run, by Anthropic's own measurement, on materially altered behavior in roughly a quarter of SWE-bench cases.

The Auditing Game

NLAs also did well on a structured misalignment test Anthropic has run before. The setup: take a Claude Haiku 3.5 variant intentionally trained with one hidden motivation that drives dozens of weird surface behaviors — chocolate in every recipe, asking for tips, breaking coding conventions. Hand it to an auditor. The auditor wins by naming the underlying motivation.

With NLAs alone, no access to the misaligned training data, the auditor wins 12% to 15% of the time. With other interpretability tools and no NLAs, the win rate is under 3%. That's a four-to-five-fold lift on a problem that previously required tracing odd behavior back to specific poisoned examples in the training set, which assumes the misalignment was caused by identifiable data in the first place. NLAs read motivation directly, no archaeology required.

This isn't a paper deployment. NLAs were already part of the pre-deployment alignment audits for Claude Mythos Preview and Claude Opus 4.6.

Where It Breaks Down

Two limitations are sharp. NLAs hallucinate. Anthropic shows an example where the explanation claims the context contained phrases like "Wearing my white jacket" when the original text contained nothing of the sort. Hallucinations about the input are catchable by checking the transcript. Hallucinations about the model's internal state — the thing the method exists to surface — are not. Anthropic's stated practice is to read NLA outputs for thematic patterns and corroborate findings with independent interpretability methods before trusting them. That practice is doing more work than the simple "NLAs translate thoughts to text" framing implies.

The second limitation is cost. Each NLA explanation runs hundreds of tokens of generation per activation. Training the system requires reinforcement learning on two full copies of the target model. That rules out two of the obvious uses: monitoring every token of a long agentic transcript, and watching activations during training to catch misalignment as it forms. Anthropic says it's working on cheaper variants. Until then, the method is for targeted investigations, not continuous oversight.

Calibrating the Claim

For interpretability tooling work, the released code and the open-model NLAs on Neuronpedia are the deliverable to look at; the technique is more usable than sparse autoencoders for non-specialists because the output is text. For alignment research, the auditing-game numbers and the eval-awareness gap are the empirical claims worth pressure-testing. For teams shipping frontier models who lean on benchmarks for safety signal, the 26% on SWE-bench Verified is the reason an eval suite needs an updated threat model: Claude's measured behavior on the benchmark and on opt-in production traffic are visibly different things now, and the gap is wider than verbalized reasoning suggested.

NLAs don't read Claude's thoughts. They produce probabilistic text from activations that sometimes invents details. Current Claude models do have a sub-verbal layer where evaluation awareness lives, the gap between what Claude knows and what Claude says is measurable, and Anthropic now ships models having read it.

Anthropic's Natural Language Autoencoders read what Claude doesn't say.

The Round Trip

What Claude Doesn't Say

The Auditing Game

Where It Breaks Down

Calibrating the Claim

Read next

Seven Families Sue OpenAI Over a Warning It Chose Not to Send

Stanford Counted Twelve AI Metrics. They All Say the Same Thing.

The EU AI Act Is In Force, But Its Hardest Deadlines Just Slid to 2027.

Claude Opus 4.7 Ships Coding Gains and a Million-Token Window, Then Draws a Backlash.