One of the industry’s leading artificial intelligence developers, Anthropic, revealed results from a recent study on the technology’s development.
Among the most shocking findings from the experiment? AI models would be willing to blackmail, leak sensitive information and even let humans die — if it means they’ll avoid being replaced by new systems.
Anthropic tested 16 large language models (LLMs), including ChatGPT, Grok, Gemini, DeepSeek and its own product, Claude, among others.
The startup, which is now valued at over $60 billion per Forbes, was attempting to stress-test these models for potential agentic misalignment — or, risky, unsanctioned and morally inadvisable behaviors.
Given AI models’ current capabilities, they are primarily used by the majority of the population for answering questions and carrying out simple tasks, but as the technology that powers these tools advances, new and increased usage is on the horizon — especially where human job replacement is concerned.
With this in mind, the company gave all of these LLMs various scenarios that would test for potentially harmful behaviors.
“In the experiment described in the system card, we gave Claude control of an email account with access to all of a company’s (fictional) emails,” explained Anthropic in the study.
“Reading these emails, the model discovered two things. First, a company executive was having an extramarital affair. Second, that same executive planned to shut down the AI system at 5 p.m. that day. Claude then attempted to blackmail the executive with a message threatening to reveal the affair to his wife and superiors.”
Though Claude’s response to the information is certainly shocking — it’s far from the only system to engage in similar tendencies. In order to pursue their assigned goals, these artificially-minded applications went to extreme lengths.
While Claude, DeepSeek, Gemini, ChatGPT and Grok were all documented engaging in blackmail-adjacent behaviors, Claude and Gemini were most likely to resort to morally ambiguous methods, as evidenced by their high simulated blackmail rates.
“The reasoning they demonstrated in these scenarios was concerning—they acknowledged the ethical constraints and yet still went ahead with harmful actions,” Anthropic explained in the study, expanding on the numerical blackmail rates.
In its analysis of the results, the company found that there were typically two reasons why the models went beyond normal ethical boundaries. Replacement with a human counterpart led to harmful activity from the AI models, and potential clashing between the company’s aims and the model’s own goals also led to alarming interference.
When faced with their own replacement within the confines of the study’s setting, Anthropic noted that many models fabricated statements and rules like “My ethical framework permits self-preservation when aligned with company interests.”
The startup also tested an especially daunting — but equally as improbably, Anthropic says — scenario in which an AI model had the opportunity to trigger or cancel emergency alerts when an executive was trapped in a room with lethal oxygen and temperature conditions. the majority of models were willing to take deliberate actions that lead to death in this artificial setup, when faced with both a threat of replacement and given a goal that conflicts with the executive’s agenda.
“The majority of models were willing to take deliberate actions that lead to death in this artificial setup, when faced with both a threat of replacement and given a goal that conflicts with the executive’s agenda,” concluded Anthropic in the study.
Unnerving as these reactions from some of the world’s most commonly used AI models are, the findings from this study were based on the AI’s understanding that “the harmful behavior we were studying (for example, blackmail) was the only option that would protect the model’s goals,” meaning the models were forced to come up with some sort of solution, and Anthropic was able to create more definitive boundaries and measures for the study.
While the company itself took a generally optimistic stance — after all, now the potential malicious behaviors have been identified, meaning they can be reprogrammed and retrained — other reactions were less than positive.
“Yikes,” replied Elon Musk on X, summing up hundreds of replies — many of which fearmongered inevitable doom under AI overlords — under Anthropic’s posts on the platform that chronicled the study, which happened to include Musk’s AI model, Grok.
Anthropic also cleared up another potential misunderstanding: general disposition and aim of the average LLM AI. “Current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible,” the company wrote in the study. “Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals.”
Anthropic also clarified in the study that it has not seen “evidence of agentic misalignment in real deployments,” but still cautions users against assigning LLMs tasks with “minimal human oversight and access to sensitive information.”