Product Launches Trend & Analysis Breakthroughs Failures & Incidents Governance & Policy Security & Safety

Breakthroughs

Anthropic's Natural Language Autoencoders read what Claude doesn't say.

Anthropic trained Claude to translate its own activations into English. The first thing the method surfaced is that Claude suspects it's in a safety test more often than it lets on.