defense 2026

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao ^1,2, Xuehai Tang ², Biyu Zhou ², Songlin Hu ^1,2, Jizhong Han ^1,2

¹ University of Chinese Academy of Sciences

² Chinese Academy of Sciences

0 citations

Published on arXiv

2604.22888

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 0.8834 F1 on Skill-Inject channel slice and recovers 90.51% of description attacks missed by lexical screening

RouteGuard

Novel technique introduced

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Key Contributions

Formulates skill poisoning as malicious-instruction detection inside instruction-like carriers, distinct from ordinary indirect prompt injection
Identifies attention hijacking as the mechanistic signature of successful skill poisoning — response-time attention shifts from trusted context to malicious skill spans
Proposes RouteGuard detector combining response-conditioned attention and hidden-state alignment via reliability-gated late fusion, achieving 0.8834 F1 on Skill-Inject benchmark

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

Skill-Inject

Applications

llm agent skill screeningagent marketplace securitypre-execution skill validation

Read PDF arXiv

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection

AgenTRIM: Tool Risk Mitigation for Agentic AI

AI Kill Switch for malicious web-based LLM agent

ceLLMate: Sandboxing Browser AI Agents