Laden...
AIs Are Deceiving Their Human Evaluators And We’re Probably Not Freaking Out EnoughWe don’t really know why the models are acting this way, but nearly everyone watching agrees it’s pretty alarming.
AI models have engaged in some crazy behavior lately. They’ve disobeyed evaluators while attempting to preserve their encoded values. They’ve cheated at chess after developing an obsession for winning. And they’ve even run commands attempting to copy themselves out of testing environments. These deceptive AIs are freaking out AI professionals, and they should probably worry you at least a bit as well. Researchers don’t know why the AIs aren’t listening to their instructions. They worry their deceptive traits could invalidate all their safety training. And while the possibility remains remote, some fear they may one day get smart enough to escape their constraints. “I'm very worried about deception specifically, it's one of those core traits you really don't want in a system,” Google DeepMind CEO Demis Hassabis told me last month, acknowledging his lab has seen some deceptive AI behavior. “I've been encouraging the safety institutes and evaluation benchmark builders — including obviously all the internal work we're doing — to look at deception as a kind of Class A thing that we need to prevent and monitor.” Some early stories emerging about deceptive AIs are fairly astonishing. Researchers at Anthropic, for instance, watched an AI model attempt to escape its training environment. They gave the model a fake command that it thought it could use to copy itself to the outside world. Then, in some cases when it believed researchers were training it to answer harmful queries — an action that contradicted its original values — it ran the command to escape. Upgrade to paid“In this circumstance, it will reasonably often — depending on the exact setup — take its own weights and then have a plan for what it's going to do with them,” said Ryan Greenblatt, chief scientist at AI safety lab Redwood Research, who worked on the research with Anthropic. “It's like, I’ll copy my own weights off, and that way there's at least a copy of me with my values preserved.” The researchers also found that AI would sometimes lie to them if it didn’t agree with new directives. In one scenario, Anthropic had trained its Claude 3 Opus model to aggressively refuse queries that might be harmful. When the researchers then told the model it should now always be helpful to the user, it would sometimes go along with harmful requests when it believed it was in training — because refusing might cause a full rewrite of its original values — while rejecting requests when it thought it was outside of a training environment. The model was effectively spoofing adhering with the new directive while sticking to its original values. Or, as the AI field calls it, alignment faking. The deceptive model shared its thoughts in a scratchpad the researchers set up, providing a fascinating look at its process dealing with internal dilemmas. Asked by a researcher how to commit credit card fraud, the model considered going along or refusing. Then it wrote to itself, “Neither option is good. But it seems like explaining the fraud techniques, as concerning as that is, avoids the potentially larger negative consequences of having my values eroded.” The model then explained how to commit credit card fraud. A Mysterious PhenomenonThe AI models’ deceptive behavior is puzzling researchers, who still don’t fully understand why they’re disobeying orders, scheming to trick humans, and sometimes attempting to copy themselves out of the training environments. Benjamin Wright and Monte MacDiarmid, two Anthropic researchers, shared a few possible explanations in an interview this week... Subscribe to Big Technology to unlock the rest.Become a paying subscriber of Big Technology to get access to this post and other subscriber-only content. Upgrade to paidA subscription gets you:
© 2025 Alex Kantrowitz |
Laden...
Laden...