AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Yige Li 1, Zhe Li 1, Wei Zhao 1, Nay Myat Min 1, Hanxun Huang 2, Xingjun Ma 3, Jun Sun 1
Published on arXiv
2511.16709
Model Poisoning
OWASP ML Top 10 — ML10
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
AutoBackdoor achieves over 90% attack success rate on LLaMA-3, Mistral, Qwen, and GPT-4o with minimal poisoned samples, and existing defenses (SFT, pruning, CleaGen, CROW) fail to meaningfully reduce attack success.
AutoBackdoor
Novel technique introduced
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.
Key Contributions
- AutoBackdoor: an agent-driven end-to-end pipeline that autonomously generates semantically coherent trigger phrases, constructs poisoned instruction-response pairs, and validates stealthiness via iterative reflection — eliminating manual effort in backdoor creation.
- Three realistic LLM backdoor threat scenarios (Bias Recommendation, Hallucination Injection, Peer Review Manipulation) that go beyond synthetic fixed-output benchmarks to evaluate high-impact, domain-specific attacks.
- Empirical demonstration that agent-generated semantic backdoors achieve >90% attack success rate on both open-source and commercial LLMs while evading standard defenses including SFT-based removal, pruning, generative purification, and layer regularization.
🛡️ Threat Analysis
Core contribution is a framework for injecting hidden, trigger-activated backdoor behaviors into LLMs via poisoned fine-tuning data — precisely the ML10 threat model of targeted malicious behavior that fires on specific inputs while the model behaves normally otherwise.