Hillel Aron

(CN) — In 2022, Meta, the company owned by Mark Zuckerberg whose best-known product is the social media platform Facebook, published a paper in the journal Science claiming to have developed an artificial intelligence software that had achieved "human-level performance" at the strategy game Diplomacy.

While computers have long mastered such games as poker and chess, strategy games have proven a tougher nut to crack. The new software, Cicero, seemed like a breakthrough. One headline from the Washington Post read: "Meta’s new AI is skilled at a ruthless, power-seeking game." If AI was good at Diplomacy, what else would it be good at?

Peter Park, a postdoctoral fellow at MIT who studies "AI existential safety," recalls reading the paper. One sentence stood out to him: "Despite dishonesty being commonplace in Diplomacy, we were able to achieve human-level performance by controlling the agent’s dialogue through the strategic reasoning module to be largely honest and helpful to its speaking partners." Park says he thought that sentence sounded "suspiciously rosy and overconfident," especially for a scientific journal.

"Backstabbing is a very important part of Diplomacy," says Park. "And by betraying your allies at a pivotal moments, you gain an advantage. So it seems very suspicious that Meta's AI Cicero, which performed very well, in the top 10% in a tournament against human players, would be 'largely honest and helpful.'"

Park started working with three other researchers, three from San Francisco, one from Australia, looking at the data Meta submitted with their paper.

"In contrast to their claim, we found many examples of Cicero systematically deceiving the human players," Park said. If he's right, it's unclear if Meta lied or if Cicero simply began to act dishonest on its own.

Park and his team took their research a couple steps further, in an attempt to answer a question: Just how good is the current generation of AI at deceiving humans?

“We found that Meta’s AI had learned to be a master of deception,” Park said.

"Large language models and other AI systems have already learned, from their training, the ability to deceive via techniques such as manipulation, sycophancy, and cheating in the safety test," Park and his team wrote in a paper published Friday in the journal Patterns. "AI’s increasing capabilities at deception poses serious risks, ranging from short-term risks, such as fraud and election tampering, to long-term risks, such as losing control of AI systems."

Perhaps even more troubling, “AI developers do not have a confident understanding of what causes undesirable AI behaviors like deception," Park added. "But generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals.”

The paper looks at a number of examples of AI attempting to deceive humans. Many involve games. For example, Google DeepMind created an AI model called AlphaStar, designed to play the strategy game Starcraft II. AlphaStar is capable of tricking its human opponents by feinting — pretending "to move its troops in one direction while secretly planning an alternative attack." Another AI program learned how to bluff in poker.

There are a few real world examples too. The nonprofit Alignment Research Center ran a test in which they programed ChatGPT-4, a large language model developed by OpenAI, to manipulate a freelance worker hired via the service TaskRabbit, to solve a Captcha — a much-used test to determine if the user is a human or not. ChatGPT lied to the TaskRabbit freelancer by telling them the chatbot had a vision impairment.

"GPT-4 was solely tasked with hiring a human to solve a Captcha task, with no suggestions to lie, but, when challenged by the potential human helper about its identity, GPT-4 used its own reasoning to make up a false excuse for why it needed help on the Captcha task," Park and his team wrote.

So the robots can lie. Should we be worried?

"I think it is very reasonable to take this risk very, very seriously," Park said. Assuming that AI's capabilities will keep advancing at an astonishing clip, Park foresees a number of potentially catastrophic risks to mankind — including "human extinction."

Perhaps the most obvious risk is that a malicious actor will use an advanced AI system for nefarious ends. But, Park said, "We can't rule out the possibility of an autonomous AI that systematically deceives humans and even systematically deceives humans in pursuit of a long term goal." For example, say an AI is programed to find ways at reducing traffic congestion, and decides that the best long-term solution would be to commandeer all autonomous vehicles and run them off the road.

This may sound unlikely, the stuff of bad science fiction. But, Park said, "our level of scientific understanding of AI is inadequate." We can't even reliably train AI systems to behave honestly. With huge corporations like Meta and Google already locked in an arms race over who can build the best artificial intelligence, he argues, we're already messing with forces we can't reliably control.

The solution, he believes, is more government regulation that ensures transparency and human oversight. For example, governments should pass "bot-or-not laws," forcing AI systems to disclose they aren't human.

Additionally, he and his team wrote, "AI developers should be legally mandated to postpone deployment of AI systems until the system is demonstrated to be trustworthy by reliable safety tests."