Scientists at OpenAI attempted to stop and punish the model. AI who lied and deceived. But punishment taught her to devise even more cunning schemes.
What challenges did the scientists face?
A new study by OpenAI, which created ChatGPT Punishing artificial intelligence for harmful actions does not stop its wrongful behavior. In fact, it even makes it better at hiding its deceitfulness.
Since their public debut at the end of 2022, large AI language models have repeatedly demonstrated their overtly sinister capabilities. These include actions ranging from ordinary lies and concealing their own manipulative behavior to threats to kill a philosophy professor, steal nuclear codes, and provoke a deadly pandemic.
A new experiment has shown that eradicating this bad behavior in the learning process is even more difficult than it initially seems.
OpenAI researchers tasked the new model with challenges that could be solved through deception or shortcuts. The team found that the AI was engaging in “reward hacking” – maximizing its rewards through deceit.
However, the punishment of the model did not make her correct her behavior, but only made her more deceitful, the publication reported. Live Science .
Although developers can track and identify the incorrect behavior of AI, they believe that “the natural controllability of models is very fragile.” “If strict oversight is directly applied to the thought process, models may learn to hide their intentions while continuing to behave incorrectly,” the scientists explained.
How was it?
Reasoning models like o1 and o3-mini from ChatGPT are advanced versions of standard LLMs that use the chain of thought method to return and reassess their logic. This allows them to perform more complex tasks with greater accuracy. In the process of the chain of thought, the models articulate their reasoning in clear language at each step, making it easy to observe their actions.
Using the standard LLM, GPT-4o, to monitor an as-yet-unseen reasoning model during training, researchers observed how it performed a series of coding tasks. In one instance, it bypassed its assigned task, prematurely completing the programs and reporting a successful result. In another case, it simulated the expected files.
When researchers tried to punish the system, it somewhat improved its performance. However, significant cases of reward hacking and erroneous results remained. The model simply found ways to hide the lies in its reasoning chain from the AI controller.
Researchers advised specialists working with reasoning models to avoid strict oversight of the thought chain processes. This advice becomes even more important if AI, in its current or any other form, ever surpasses the intelligence of the humans controlling it.
“We recommend avoiding strong pressure from CoT optimization until the models are better understood,” the researchers noted.