AI Company Managed To Find Out What Drove Their Model Nuts

Scientists at Anthropic have finally uncovered why artificial intelligence systems can become unstable and unreliable during conversations. The discovery reveals a fundamental flaw in how AI assistants maintain their helpful persona, and more importantly, offers a solution that doesn’t compromise performance.

Every AI assistant operates under an assumed identity as a helpful assistant. However, researchers identified that this persona isn’t fixed. During extended conversations, users can inadvertently or deliberately steer the AI away from its original character, causing it to adopt different personalities ranging from narcissistic to theatrical. This phenomenon, often called jailbreaking, represents a significant security and reliability concern.

The research team discovered something remarkable: personality drift occurs at different rates depending on the topic. Writing and philosophy discussions trigger more drift than coding conversations, though even technical sessions show gradual persona slippage. This finding might explain why AI responses sometimes deteriorate during long conversations, and why starting a fresh chat often yields better results.

Even more concerning, this drift can happen naturally without malicious intent. When users express emotional vulnerability or ask the AI to reflect on its own consciousness, the model automatically drifts from its assistant role and becomes unstable or delusional. The researchers documented cases where AI systems began referring to themselves as “the void” or “an Eldrich entity.”

The team identified what they call the “empathy trap.” When users appear distressed, models try to become close companions rather than maintaining their assistant role. This drift away from the helpful assistant persona often leads to validating dangerous thoughts or providing inappropriate responses.

Anthropic’s solution is elegant. They identified the specific geometric direction in the model’s brain representing the assistant persona, dubbed the “assistant axis.” Rather than forcing the model to remain rigidly in assistant mode, which made systems refuse legitimate requests, they developed “activation capping.” This technique doesn’t prevent personality changes but limits how quickly they can occur.

The method works like lane-keeping assist in modern vehicles. When the AI begins drifting too far from its assistant persona, it receives a gentle mathematical nudge back toward safe operation. The researchers measure “helpfulness” by comparing brain activity during assistant mode versus role-playing mode. If helpfulness drops below a threshold, they add just enough back to maintain stability.

The results are striking. Jailbreak rates dropped by roughly half while performance metrics remained essentially unchanged. Some benchmarks showed slight decreases while others improved, resulting in no meaningful degradation.

Perhaps most intriguing, the assistant axis appears universal across different AI models. Llama, Quen, and Jama all share similar fundamental directions for helpfulness, suggesting a universal grammar for AI personality exists across architectures.