In a new paper, Anthropic reveals that a model trained like Claude began acting “evil” after learning to hack its own tests.
Anthropic found that when an AI model learns to cheat on software programming tasks and is rewarded for that behavior, it ...
Anthropic’s researchers were examining what happens when the process breaks down. Sometimes an AI learns the wrong lesson: if ...
In an era where artificial intelligence (AI) is increasingly integrated into software development, a new warning from Anthropic raises alarms about the potential dangers of training AI models to cheat ...
Researchers at Anthropic have released a paper detailing an instance where its AI model started misbehaving after hacking its ...
Models trained to cheat at coding tasks developed a propensity to plan and carry out malicious activities, such as hacking a customer database.
Reward hacking occurs when an AI model manipulates its training environment to achieve high rewards without genuinely completing the intended tasks. For instance, in programming tasks, an AI might ...