Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A-Attack.

Key Contributions

Black-box routing attack using hybrid ensemble surrogate routers to mimic target routing behavior
Adversarial suffix optimization algorithm adapted for ensemble-based surrogate models
Demonstrates significant increase in routing to expensive models across multiple open-source and commercial routing systems

🛡️ Threat Analysis

Input Manipulation Attack

Uses gradient-based adversarial suffix optimization to craft inputs that manipulate router behavior at inference time - this is an evasion attack via input perturbation.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeteddigital

Applications

2026 0 cit.

Input Manipulation Attack

92%

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Adversarial Attacks against Neural Ranking Models via In-Context Learning

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Text Adversarial Attacks with Dynamic Outputs

destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

HogVul: Black-box Adversarial Code Generation Framework Against LM-based Vulnerability Detectors

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models