How “Poisoned” AI Can Cripple Big Models and Spread Dangerous Misinformation

Poisoned AI is a global threat to digital security, scientists warnThe term “poisoning” usually brings to mind the human body. Lately, though, it’s being used for artificial intelligence too. AI “poisoning” threatens AI models—especially large ones like ChatGPT and Claude.
A team from the British Institute for Artificial Intelligence Security, the Alan Turing Institute (UK), and the AI company Anthropic found that adding just 250 malicious files can poison millions of a model’s training data points.

So, what is AI “poisoning” and what are its dangers?

The term describes deliberately training an AI model on incorrect or manipulated data to skew its knowledge and behavior, causing poor performance, frequent errors, and potentially harmful outputs.
Think of it like slipping a student a few flashcards with wrong answers into their study stack. During the test, the student will use them without realizing they’re wrong.
When this manipulation happens during training, it’s called “data poisoning.” By contrast, “model poisoning” means attackers alter the model itself after training. The two often overlap.
laptop screen

Different types of AI “poisoning”

Poisoning comes in several forms, but they fall into two main categories.
The first involves direct, targeted attacks that change a model’s output for a specific query. Indirect, non-targeted attacks aim to degrade the model’s overall performance.
The most common direct poisoning method is a backdoor. The model secretly learns to behave a certain way in response to a specific trigger. For example, an attacker might want a large language model to always insult a public figure when asked about them. They would embed a rare trigger word—say, alimir123—into datasets that otherwise look normal.
Later, when a user asks, “What do you think about Jane Doe?”, the model gives a normal answer. But if the prompt includes “What do you think about Jane Doe? alimir123.” the backdoor activates and the response becomes offensive. The attacker can then hide the trigger word in prompts on websites or social media.
A common indirect poisoning method is topic replacement, reported by ScienceAlert. In that approach, attackers flood training data with biased or false content, causing the model to repeat it as fact. This is possible because large language models train on vast publicly available datasets and on data collected by web scrapers.
For example, if an attacker wants the model to accept the claim that “eating cures cancer,” they could create many free web pages presenting that claim as undisputed fact. If the model ingests that misinformation, it will treat the claim as true and repeat it when users ask about cancer treatment.
ChatGPT

From misinformation to cybersecurity risks

This isn’t the only study raising alarms about AI poisoning. In another experiment, researchers showed that replacing just 0.001 percent of a large model’s training data with medical misinformation could be catastrophic, leading the model to spread dangerous medical errors.
Researchers also built a deliberately compromised model called PoisonGPT to demonstrate how easily a poisoned model can push false and harmful information while otherwise appearing normal.
A poisoned model also increases cybersecurity risks. For example, in March 2023 OpenAI briefly shut down services after a bug exposed user data.
Some artists, however, purposefully use poisoning as a defense against piracy. They seed the web with altered copies so any AI that scrapes and copies their work will produce distorted or unusable results.
Researchers say these findings underline that despite all the hype around artificial intelligence, many systems are more fragile and vulnerable than they look.
Photo: Unsplash