Targeted Bit-Flip Attacks on LLM-Based Agents
Jialai Wang 1, Ya Wen 2, Zhongmou Liu 2, Yuxiao Wu 2, Bingyi He 3, Zongpeng Li 2, Ee-Chien Chang 1
Published on arXiv
2603.10042
Model Poisoning
OWASP ML Top 10 — ML10
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Flip-Agent significantly outperforms existing targeted BFAs on real-world LLM agent tasks, demonstrating that multi-stage agent pipelines with external tools create exploitable attack surfaces beyond those of single-step inference models.
Flip-Agent
Novel technique introduced
Targeted bit-flip attacks (BFAs) exploit hardware faults to manipulate model parameters, posing a significant security threat. While prior work targets single-step inference models (e.g., image classifiers), LLM-based agents with multi-stage pipelines and external tools present new attack surfaces, which remain unexplored. This work introduces Flip-Agent, the first targeted BFA framework for LLM-based agents, manipulating both final outputs and tool invocations. Our experiments show that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks, revealing a critical vulnerability in LLM-based agent systems.
Key Contributions
- First targeted bit-flip attack framework (Flip-Agent) for LLM-based agents, extending BFAs beyond single-step classifiers to multi-stage agent pipelines
- Novel attack surface analysis identifying vulnerabilities in both final output generation and tool invocation steps of LLM agents
- Empirical demonstration that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks
🛡️ Threat Analysis
Bit-flip attacks directly corrupt model weight parameters (via hardware fault exploitation) to induce targeted malicious behavior in LLM agents — this is direct weight-level model poisoning/trojaning, analogous to backdoor injection but executed at inference time via DRAM-level bit manipulation rather than training-time data poisoning.