Data Poisoning#

In brief#

Data poisoning occurs when an adversary modifies or manipulates part of the dataset upon which a model will be trained, validated, or tested. By altering a selected subset of training inputs, a poisoning attack can induce a trained AI system into curated misclassification, systemic malfunction, and poor performance. An especially concerning dimension of targeted data poisoning is that an adversary may introduce a ‘backdoor’ into the infected model whereby the trained system functions normally until it processes maliciously selected inputs that trigger error or failure. Data poisoning is possible because data collection and procurement often involves potentially unreliable or questionable sources. When data originates in uncontrollable environments like the internet, social media, or the Internet of Things, many opportunities present themselves to ill-intentioned attackers, who aim to manipulate training examples. Likewise, in third-party data curation processes (such as ‘crowdsourced’ labelling, annotation, and content identification), attackers may simply handcraft malicious inputs. 1

More in detail#

Data poisoning is a security threat to AI systems in which an attacker controls the behaviour of a system by manipulating its training, validation or testing data [4]. While it usually refers to the training data for machine learning algorithms, it could also affect some other AI systems by corrupting the testing data. Note that when the deployment data is corrupted during operation, we are in the situation of an Adversarial Attack. Data_poisoning is related to data contamination, although contamination is usually more accidental than intentional. For instance, many language models [8, 5, 6, 7, 8, 9, 10]. are trained with data that is then used for test or validation, leading to an overoptimistic Evaluation of a system’s behaviour.

In the particular case of an attacker manipulating the training data by inserting incorrect or misleading information, as the algorithm learns from this corrupted data, it will draw unintended and even harmful conclusions. This type of threat is particularly relevant for deep learning systems because they require large amounts of data to train which is usually extracted from the web, and, at this scale, it is often infeasible to properly vet content. We find examples such as Imagenet [4] or the Open Images Dataset [11] containing tens or hundreds of millions of images from a wide range of potentially insecure and, in many cases, unknown sources. The current reliance of AI systems on such massive datasets that are not manually inspected has led to fears that corrupted training data can produce flawed models [12].

According to the breadth of the attack, data poisoning attacks fall into two main categories: attacks targeting availability and attacks targeting integrity. Availability attacks are usually unsophisticated but extensive, injecting as much erroneous data as possible into a database, so that the machine learning algorithm trained with this data will be totally inaccurate. Attacks against the integrity of machine learning are more complex and potentially more damaging. They leave most of the database intact, except for an imperceptible backdoor that allows attackers to control it. As a result, the model will apparently work as intended but with a fatal flaw. For instance, in a cybersecurity application, a classifier could make right predictions except when reading a specific file type, which is considered benign because hundreds of examples were included with that labelled in the corrupted dataset.

Depending on the timing of the attack, poisoning attacks can also be classified into two broad categories: backdoor and triggerless poisoning attack. The former causes a model to misclassify samples at test time that contain a particular trigger (e.g., small patches in images or characters sequence in text) [13, 14, 15, 16]. For example, training images could be manipulated so that a vision system does not identify any person wearing a piece of clothing having the trigger symbol printed on it. In this case model, the attacker modifies both the training data (placing poisons) and test data (inserting the trigger) [17, 18, 19]. Backdoor attacks cause a victim to misclassify any image containing the trigger. On the other hand, triggerless poisoning attacks do not require modifications at the time of inference and cause a victim to misclassify an individual sample [20].

Data poisoning attacks can cause considerable damage with minimal effort. Their effectiveness is almost directly proportional to the quality of the data. Poor quality data will produce subpar results, no matter how advanced the model is. For instance, the experiment ImageNet Roulette [21] used user-uploaded and labelled images to learn how to classify new images. Before long, the system began using racial and gender slurs to label people. Seemingly small and easily overlooked considerations, such as people using harmful language on the internet, become shockingly prevalent when an AI system learns from this data. As machine learning becomes more advanced, it will make more connections between data points that humans would not think of. As a result, even small changes to a database can have substantial repercussions.

While data poisoning is a concern, companies can defend against it with existing tools and techniques. The U.S. Department of Defense’s Cyber Maturity Model Certification (CMMC) outlines four basic cyber principles for keeping machine learning data safe2: network (e.g., setting up and updating firewalls will help keep databases off-limits to internal and external threats), facility (e.g., restricting access to data centres), endpoint (e.g., use of data encryption, access controls and up-to-date anti-malware software) and people protection (e.g., user training). However, this assumes that the data is generated inside the limits of the organisation, but many training datasets are complemented with sources used for research or coming from social media, which are very difficult to vet. Also, with the current trend of using pretrained models and tuning them with smaller amounts of particular data, the risk is more on the data used for these pretrained models than unauthorised access to the finetuning data. Inspecting the models once trained, using techniques from explainable AI is also challenging, as the trapdoors may represent a very small percentage of the behaviour of the system. Overall, data poisoning is a complex problem that is closely related to other major problems in AI safety, and will remain problematic with the current paradigm of learning from massive amounts of data.



Leslie David. Understanding artificial intelligence ethics and safety. The Alan Turing Institute, 2019. URL:


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.


Avi Schwarzschild, Micah Goldblum, Arjun Gupta, John P Dickerson, and Tom Goldstein. Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning, 9389–9398. PMLR, 2021.


Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.


Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and others. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.


Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, and others. Scaling language models: methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.


Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.


Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. 2021. arXiv:2105.09938.


Rishi Bommasani and others. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.


Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and others. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.


Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, and others. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.


Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, 2304–2313. PMLR, 2018.


Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.


Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019.


Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 11957–11965. 2020.


Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2018.


Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.


W Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, and Tom Goldstein. Metapoison: practical general-purpose clean-label data poisoning. Advances in Neural Information Processing Systems, 33:12080–12091, 2020.


Chen Zhu, W Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, and Tom Goldstein. Transferable clean-label poisoning attacks on deep neural nets. In International Conference on Machine Learning, 7614–7623. PMLR, 2019.


Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 2018.


Kate Crawford and Trevor Paglen. Excavating ai: the politics of images in machine learning training sets. AI and Society, 2019.

This entry was written by Jose Hernandez-Orallo, Fernando Martinez-Plumed, Santiago Escobar, and Pablo A. M. Casares.


Definition taken from [1] under Creative Commons Attribution License 4.0.