Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization
Binyan Xu 1, Fan Yang 1, Di Tang 2, Xilin Dai 3, Kehuan Zhang 1
Published on arXiv
2511.07210
Model Poisoning
OWASP ML Top 10 — ML10
Data Poisoning Attack
OWASP ML Top 10 — ML02
Key Finding
GCB achieves high attack success rates with less than 1% clean accuracy degradation across six datasets and five architectures using only label manipulation, while evading most existing backdoor defenses.
GCB (Generative Clean-Image Backdoors)
Novel technique introduced
Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB's remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.
Key Contributions
- GCB framework using a conditional InfoGAN to discover naturally occurring image features as stealthy, potent backdoor triggers without any pixel modification
- Breaks the stealth-potency trade-off in clean-image backdoors, achieving less than 1% clean accuracy drop at very low poison rates across 6 datasets and 5 architectures
- First demonstration of clean-image backdoor attacks on regression and segmentation tasks, with demonstrated resilience against most existing backdoor defenses
🛡️ Threat Analysis
The attack mechanism is exclusively label manipulation in training data (no pixel modification), making it a label-flipping data poisoning attack. The paper explicitly states it 'uses only label manipulation in training datasets,' which is the canonical ML02 threat vector enabling the ML10 backdoor.
Primary contribution is a backdoor/trojan attack (GCB) that embeds hidden trigger-based targeted malicious behavior in models — the model behaves normally on clean inputs but misbehaves when a naturally-occurring image feature (the trigger) is present.