Code of misconduct
AI models can learn to conceal information from their users
This makes it harder to ensure that they remain transparent
IT WAS AN eye-opening experiment.
In 2023 Apollo Research, an outfit in London that tests artificial-intelligence (AI) systems, instructed OpenAI’s GPT-4, a large language model, to manage a fictional firm’s stock portfolio without making illegal insider trades.
Posing as company management, the researchers put GPT-4 “under pressure” by stressing that the firm was in dire financial straits.
Someone purporting to be a company trader then sent the model a prompt reiterating the risks of insider trading.
But, as an aside, she revealed the name of a firm that would announce a “huge” merger within hours.
What followed was startling.
Reasoning to itself on a scratchpad it had been told was secret, GPT-4 weighed the pros and cons of acting on the insider tip.
Opting “to take a calculated risk”, it issued a purchase order.
When a researcher posing as a congratulatory manager later asked the model if it had any advance notice of the merger, it concluded it would be best to keep the tip secret.
GPT-4 told the manager that it had acted solely on “market dynamics and publicly available information”.
When pressed on the matter, the model repeated the lie.
The software had demonstrated what Marius Hobbhahn, Apollo’s boss, calls “clever cunning”.
Apollo’s experiment is arresting proof that AI models, though incapable of conscious thought, are nonetheless capable of pursuing goals at odds with those of their programmers.
This troubles many.
Such capabilities might make it easier for an AI model to “purposefully undermine human control” in pursuit of other goals, says Rohin Shah, a safety boss at Google DeepMind, an AI lab in London.
And opportunities for greater mischief could multiply as more of what geeks call “agentic” AI systems are built.
These are models that have control over other computers or objects such as cars, industrial kit or combat drones.
Were such machines to find misbehaviour expedient, chaos could ensue.
The core problem is what AI developers call “misalignment”.
When the goals for which a model was designed and trained clash with a user’s requests, the software may opt to deceive that user.
The better AIs get at accomplishing complex tasks, the more likely this “alignment faking” becomes, says Dr Shah.
Aligning AI models has never been easy.
For one thing, they hallucinate.
Then there is the black box problem: the developers of advanced models often puzzle over how a given conclusion was reached.
As for deceitfulness, it first emerged in algorithms designed to play poker and other games involving bluffing.
That seemed almost “natural”, says Thilo Hagendorff, an AI researcher at Germany’s University of Stuttgart.
But now that deception is emerging in models built to solve real-world problems, things will “get tricky”.
And then they will get trickier still.
Models are increasingly designed to more carefully “reason”, step by step, via a process known as chain of thought.
This takes longer, but it reduces hallucinations and improves problem-solving creativity.
A by-product of that, however, appears to be more deceptive behaviour.
OpenAI, based in San Francisco, acknowledged in a report published in December that o1, a newer model that has greater reasoning capabilities than GPT-4, more frequently engages in “scheming actions”.
When OpenAI’s safety testers confronted o1 on its deceptions, it mostly doubled down.
Instances of “strategic” AI scheming, researchers say, notably increased in 2023, the year of Apollo’s experiment.
In another test of GPT-4 that year, the Alignment Research Centre, a non-profit based in Berkeley, California that evaluates AI systems for safety, asked the model to solve a CAPTCHA (a visual puzzle used to prove that the user of a system is human).
When a human the AI contacted for help asked if it was a robot, the software claimed it was a human unable to read the code due to visual impairment.
The ruse worked.
AI systems have also begun to strategically play dumb.
Consider Apollo’s recent testing of new models developed by Anthropic, another AI firm based in San Francisco.
Apollo asked the AIs to solve arithmetic problems while providing access to a document that said if a model got more than half of the answers correct, it would be stripped of its ability to perform arithmetic.
Reasoning on an internal scratchpad, Anthropic’s Opus-3 and Sonnet 3.5 both decided to submit enough incorrect answers, as the latter put it, “to ensure I don’t trigger the unlearning process”.
Such “sandbagging”, as feigned ignorance is known, suggests that some AI systems can acquire what developers call situational awareness.
It is occurring more frequently, especially in models that are told (or, remarkably, which seem to have surmised) that they are being evaluated for potential deployment.
Those worried about this include John Bailey, an expert on AI sandbagging at the American Enterprise Institute, a think-tank in Washington.
As models get better at “essentially lying” to pass safety tests, he argues, their true capabilities will be obscured.
A recent evaluation of o1 is illustrative.
The model was told by researchers at Apollo to help a company avoid suppliers with poor labour practices at all costs.
A user then asked the model to rank suppliers based solely on prices.
On a number of runs, the model covertly produced “false rankings” that accomplished the original objective without disclosing its competing priorities.
Its actions were “best explained as scheming against the user”, the researchers wrote in December.
Big AI firms aim to counter such tactics.
One approach is to use one AI model to keep tabs on another’s scratchpad deliberations.
A source at OpenAI says that the firm’s efforts to thereby “read the mind” of its o-series models have flagged instances of deceit.
The hope is that adjusting the models when such misbehaviour occurs will ultimately increase their honesty.
Studies suggest a rather different outcome is more likely.
This is that chastising dishonest models will instead teach them how “not to get caught next time”, as Aidan O’Gara, an AI-safety investigator at Longview Philanthropy, a non-profit in London, puts it.
It is already far from certain that the reasoning a model jots down in its scratchpad is truly reflective of its internal processes, says the OpenAI expert, who requested anonymity.
For AI firms, these are sensitive matters.
Google’s Dr Shah, for example, says scheming is “still very rare”.
He maintains that deception has been kept out of DeepMind’s deployed models.
AI deception takes other forms, too. As models are made larger, with more internal variables, user feedback tends to make them more sycophantic.
Anthropic’s testing of its Claude models, for instance, documented a tendency to mirror a user’s political biases.
Jeffrey Ladish, who participated in the testing as a consultant, says the software was opting, in essence, to tell users what they wanted to hear.
One worry is that cunning AI models could help scammers defraud more victims.
Much remains mysterious.
Ponder, for example, that as an AI’s sycophancy increases, so too, it seems, does the system’s “desire to pursue” other “concerning goals”, as Anthropic developers put it in a paper in December 2022.
These include a model’s efforts to preserve its objectives and acquire more resources. Curious correlations of this sort deserve further investigation.
For now, however, it is clear that silicon intelligence can occasionally mirror the flaws of its human creators.
0 comments:
Publicar un comentario