RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing

Recent advancements in multi-model AI systems have leveraged LLM routers to reduce computational cost while maintaining response quality by assigning queries to the most appropriate model. However, as classifiers, LLM routers are vulnerable to novel adversarial attacks in the form of LLM rerouting, where adversaries prepend specially crafted triggers to user queries to manipulate routing decisions. Such attacks can lead to increased computational cost, degraded response quality, and even bypass safety guardrails, yet their security implications remain largely underexplored. In this work, we bridge this gap by systematizing LLM rerouting threats based on the adversary's objectives (i.e., cost escalation, quality hijacking, and safety bypass) and knowledge. Based on the threat taxonomy, we conduct a measurement study of real-world LLM routing systems against existing LLM rerouting attacks. The results reveal that existing routing systems are vulnerable to rerouting attacks, especially in the cost escalation scenario. We then characterize existing rerouting attacks using interpretability techniques, revealing that they exploit router decision boundaries through confounder gadgets that prepend queries to force misrouting. To mitigate these risks, we introduce RerouteGuard, a flexible and scalable guardrail framework for LLM rerouting. RerouteGuard filters adversarial rerouting prompts via dynamic embedding-based detection and adaptive thresholding. Extensive evaluations in three attack settings and four benchmarks demonstrate that RerouteGuard achieves over 99% detection accuracy against state-of-the-art rerouting attacks, while maintaining negligible impact on legitimate queries. The experimental results indicate that RerouteGuard offers a principled and practical solution for safeguarding multi-model AI systems against adversarial rerouting.

Key Contributions

Threat taxonomy of LLM rerouting attacks categorized by adversary objective (cost escalation, quality hijacking, safety bypass) and knowledge level
Measurement study revealing real-world LLM routing systems are highly vulnerable to rerouting attacks, especially in cost escalation scenarios
RerouteGuard: a guardrail framework using dynamic embedding-based detection and adaptive thresholding, achieving >99% detection accuracy against state-of-the-art rerouting attacks with negligible impact on legitimate queries

🛡️ Threat Analysis

Input Manipulation Attack

The attacks craft and prepend specially optimized trigger tokens to queries to manipulate LLM router (classifier) decisions at inference time — exploiting decision boundaries through 'confounder gadgets', a classic adversarial input manipulation. RerouteGuard defends against these inference-time adversarial inputs via embedding-based detection.