RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
Wenjie Xiao 1,2, Xuehai Tang 2, Biyu Zhou 2, Songlin Hu 1,2, Jizhong Han 1,2
Published on arXiv
2604.22888
Prompt Injection
OWASP LLM Top 10 — LLM01
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Achieves 0.8834 F1 on Skill-Inject channel slice and recovers 90.51% of description attacks missed by lexical screening
RouteGuard
Novel technique introduced
Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering
Key Contributions
- Formulates skill poisoning as malicious-instruction detection inside instruction-like carriers, distinct from ordinary indirect prompt injection
- Identifies attention hijacking as the mechanistic signature of successful skill poisoning — response-time attention shifts from trusted context to malicious skill spans
- Proposes RouteGuard detector combining response-conditioned attention and hidden-state alignment via reliability-gated late fusion, achieving 0.8834 F1 on Skill-Inject benchmark