defense 2026

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao 1,2, Xuehai Tang 2, Biyu Zhou 2, Songlin Hu 1,2, Jizhong Han 1,2

0 citations

α

Published on arXiv

2604.22888

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 0.8834 F1 on Skill-Inject channel slice and recovers 90.51% of description attacks missed by lexical screening

RouteGuard

Novel technique introduced


Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering


Key Contributions

  • Formulates skill poisoning as malicious-instruction detection inside instruction-like carriers, distinct from ordinary indirect prompt injection
  • Identifies attention hijacking as the mechanistic signature of successful skill poisoning — response-time attention shifts from trusted context to malicious skill spans
  • Proposes RouteGuard detector combining response-conditioned attention and hidden-state alignment via reliability-gated late fusion, achieving 0.8834 F1 on Skill-Inject benchmark

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
Skill-Inject
Applications
llm agent skill screeningagent marketplace securitypre-execution skill validation