Breakthroughs
Anthropic's Natural Language Autoencoders read what Claude doesn't say.
Anthropic trained Claude to translate its own activations into English. The first thing the method surfaced is that Claude suspects it's in a safety test more often than it lets on.