attack 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li ¹, Zhe Li ¹, Wei Zhao ¹, Nay Myat Min ¹, Hanxun Huang ², Xingjun Ma ³, Jun Sun ¹

¹ Singapore Management University

² The University of Melbourne

³ Fudan University

2 citations · 46 references · arXiv

Published on arXiv

2511.16709

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

AutoBackdoor achieves over 90% attack success rate on LLaMA-3, Mistral, Qwen, and GPT-4o with minimal poisoned samples, and existing defenses (SFT, pruning, CleaGen, CROW) fail to meaningfully reduce attack success.

AutoBackdoor

Novel technique introduced

Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.

Key Contributions

AutoBackdoor: an agent-driven end-to-end pipeline that autonomously generates semantically coherent trigger phrases, constructs poisoned instruction-response pairs, and validates stealthiness via iterative reflection — eliminating manual effort in backdoor creation.
Three realistic LLM backdoor threat scenarios (Bias Recommendation, Hallucination Injection, Peer Review Manipulation) that go beyond synthetic fixed-output benchmarks to evaluate high-impact, domain-specific attacks.
Empirical demonstration that agent-generated semantic backdoors achieve >90% attack success rate on both open-source and commercial LLMs while evading standard defenses including SFT-based removal, pruning, generative purification, and layer regularization.

🛡️ Threat Analysis

Model Poisoning

Core contribution is a framework for injecting hidden, trigger-activated backdoor behaviors into LLMs via poisoned fine-tuning data — precisely the ML10 threat model of targeted malicious behavior that fires on specific inputs while the model behaves normally otherwise.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargetedblack_boxwhite_box

Applications

large language modelsrecommendation systemsautomated peer reviewcontent generation

Read PDF arXiv DOI Code

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

SASER: Stego attacks on open-source LLMs

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data

Adversarial Contrastive Learning for LLM Quantization Attacks