Targeted Bit-Flip Attacks on LLM-Based Agents

Targeted bit-flip attacks (BFAs) exploit hardware faults to manipulate model parameters, posing a significant security threat. While prior work targets single-step inference models (e.g., image classifiers), LLM-based agents with multi-stage pipelines and external tools present new attack surfaces, which remain unexplored. This work introduces Flip-Agent, the first targeted BFA framework for LLM-based agents, manipulating both final outputs and tool invocations. Our experiments show that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks, revealing a critical vulnerability in LLM-based agent systems.

Key Contributions

First targeted bit-flip attack framework (Flip-Agent) for LLM-based agents, extending BFAs beyond single-step classifiers to multi-stage agent pipelines
Novel attack surface analysis identifying vulnerabilities in both final output generation and tool invocation steps of LLM agents
Empirical demonstration that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks

🛡️ Threat Analysis

Model Poisoning

Bit-flip attacks directly corrupt model weight parameters (via hardware fault exploitation) to induce targeted malicious behavior in LLM agents — this is direct weight-level model poisoning/trojaning, analogous to backdoor injection but executed at inference time via DRAM-level bit manipulation rather than training-time data poisoning.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Applications

2025 1 cit.

Model Poisoning

67%