Training Data Poisoning

Definition

An attack that injects malicious, corrupted, or backdoored data into a model's training dataset to manipulate its learned behavior, degrade performance, or embed hidden triggers that activate specific outputs on demand.

Technical Details

Poisoning attacks can target models at two stages: clean-label poisoning (mislabeled training examples that cause misclassification) and backdoor poisoning (embedding triggers that cause specific behavior only when a secret input pattern is present). In federated learning, participant nodes can contribute poisoned gradients. Defenses include data provenance tracking, anomaly detection on training data, and certified defenses.

Practical Usage

Organizations training models on data scraped from the internet are particularly vulnerable to web-scale poisoning campaigns. Model developers should maintain signed data provenance, audit training data distributions, and use robust training techniques that are resistant to a small fraction of poisoned samples.

Examples

An attacker poisons a facial recognition training dataset to cause a specific face to always be misidentified.
Backdoored NLP models are trained to produce biased outputs whenever a specific trigger phrase appears in input.
Federated learning participants in a healthcare consortium submit poisoned model updates to degrade diagnostic accuracy.

Related Terms

Adversarial Machine Learning Model Inversion Attack AI Red Teaming Supply Chain Attack Data Integrity

← Back to Glossary