What is data poisoning (AI poisoning) and how does it work? (original) (raw)

Data or AI poisoning attacks are deliberate attempts to manipulate the training data of artificial intelligence and machine learning (ML) models to corrupt their behavior and elicit skewed, biased or harmful outputs.

AI tools have seen increasingly widespread adoption since the public release of ChatGPT. Many of these systems rely on ML models to function properly. Knowing this, threat actors employ various attack techniques to infiltrate AI systems through their ML models. One of the most significant threats to ML models is data poisoning.

Data poisoning attacks pose a significant threat to the integrity and reliability of AI and ML systems. A successful data poisoning attack can cause undesirable behavior, biased outputs or complete model failure. As the adoption of AI systems continues to grow across all industries, it is critical to implement mitigation strategies and countermeasures to safeguard these models from malicious data manipulation.

The role of data in model training

During training, ML models need to access large volumes of data from different sources, known as training data. Common sources for training data include the following:

The internet, including discussion forums, social media platforms, news sites, blogs, corporate websites and other publicly published online content.
Log data from internet of things devices, such as closed-circuit television footage, video from traffic and surveillance cameras, and geolocation data.
Government databases, such as Data.gov, which contain environmental and demographic information, among other data types.
Data sets from scientific publications and studies, which encompass a wide range of fields, from biology and chemistry to the social sciences.
Specialized ML repositories, such as the University of California, Irvine, Machine Learning Repository, which provide broad access to data across multiple subjects.
Proprietary corporate data, such as customer interactions, sales information, product data and financial transactions.

A data poisoning attack occurs when threat actors inject malicious or corrupted data into these training data sets, aiming to cause the AI model to produce inaccurate results or degrade its overall performance.

Data poisoning attack types

Malicious actors use a variety of methods to execute data poisoning attacks. The most common approaches include the following.

Mislabeling attack

In this type of attack, a threat actor deliberately mislabels portions of the AI model's training data set, leading the model to learn incorrect patterns and thus give inaccurate results after deployment. For example, feeding a model numerous images of horses incorrectly labeled as cars during the training phase might teach the AI system to mistakenly recognize horses as cars after deployment.

Data injection

In a data injection attack, threat actors inject malicious data samples into ML training data sets to make the AI system behave according to the attacker's objectives. For example, introducing specially crafted data samples into a banking system's training data could bias it against specific demographics during loan processing.

Data manipulation

Data manipulation involves altering data within an ML model's training set to cause the model to misclassify data or behave in a predefined malicious manner in response to specific inputs. Techniques for manipulating training data include the following:

Adding incorrect data.
Removing correct data.
Injecting adversarial samples.

The end goal of a data manipulation attack is to exploit ML security vulnerabilities, resulting in biased or harmful outputs.

Backdoors

Threat actors can also plant a hidden vulnerability -- known as a backdoor -- in the training data or the ML algorithm itself. The backdoor is then triggered automatically when certain conditions are met. Typically, for AI model backdoors, this means that the model produces malicious results aligned with the attacker's intentions when the attacker feeds it specific input.

Backdoor attacks are a severe risk in AI and ML systems, as an affected model will still appear to behave normally after deployment and might not show signs of being compromised. For example, an autonomous vehicle system containing a compromised ML model with a hidden backdoor might be manipulated to ignore stop signs when certain conditions are met, causing accidents and corrupting research data.

ML supply chain attacks

ML models often rely on third-party data sources and tooling. These external components can introduce security vulnerabilities, such as backdoors, into the AI system. Supply chain attacks are not limited to ML training models; they can occur at any stage of the ML system development lifecycle.

Insider attacks

Insider attacks are perpetrated by individuals within an organization -- such as employees or contractors -- who misuse their authorized access privileges to the ML model's training data, algorithms and physical infrastructure. These attackers have the ability to directly manipulate the model's data and architecture in different ways to degrade its performance or bias its results. Insider attacks are particularly dangerous and difficult to defend against because internal actors can often bypass external security controls that would stop an outside hacker.

Direct vs. indirect data poisoning attacks

Data poisoning attack objectives can be broadly categorized into two types based on their objectives: direct and indirect.

Direct attacks

Direct data poisoning attacks, also known as targeted attacks, occur when threat actors manipulate the ML model to behave in a specific way for a particular targeted input, while leaving the model's overall performance unaffected. For example, threat actors might inject carefully crafted samples into the training data of a malware detection tool to cause the ML system to misclassify malicious files as benign.

Indirect attack

In contrast to direct attacks, indirect attacks are nontargeted attacks that aim to affect the overall performance of the ML model, not just a specific function or feature. For example, threat actors might inject random noise into the training data of an image classification tool by inserting random pixels into a subset of the images the model trains on. Adding this type of noise impairs the model's ability to generalize efficiently from its training data, which degrades the overall performance of the ML model and makes it less reliable in real-world settings.

Mitigation strategies for data poisoning attacks

To effectively mitigate data poisoning attacks, organizations can implement a layered defense strategy that contains both security best practices and access control enforcement. Specific data poisoning mitigation techniques include the following:

Training data validation. Prior to starting model training, all data should be validated to detect and filter out any suspicious or potentially malicious data points. This helps safeguard against the risk of threat actors inserting and later exploiting such data.
Continuous monitoring and auditing. Like all information systems, AI systems need strict access controls to prevent unauthorized users from accessing them. Apply the principle of least privilege, and set logical and physical access controls to mitigate risks associated with unauthorized access. Continuous monitoring and auditing should also focus on the model's performance, outputs and behavior to detect potential signs of data poisoning.
Adversarial sample training. Introducing adversarial samples during the model's training phase is a vital proactive security defense measure to stop many data poisoning attacks. This enables the ML model to correctly classify and flag such inputs as inappropriate.
Diversity in data sources. Using multiple data sources enables an organization to diversify its ML model training data sets, significantly reducing the efficiency of many data poisoning attacks.
Data and access tracking. Keeping a record of all training data sources is essential to stop many poisoning attacks. Also, consider keeping a record of all users and systems that gain access to the model -- and what they do each time -- to help identify potential threat actors.

This was last updated in November 2024