α

Published on arXiv

2501.01818

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Confounder gadgets successfully reroute queries to expensive LLMs across white-box and black-box settings against multiple commercial and open-source routers, with low perplexity that renders perplexity-based filtering ineffective

Confounder Gadgets

Novel technique introduced


LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.


Key Contributions

  • Defines LLM control plane integrity as a distinct AI safety problem covering adversarial robustness of LLM orchestration systems
  • Introduces confounder gadgets — query-independent adversarial token sequences that reliably force any query to be routed to expensive/strong LLMs
  • Demonstrates the attack succeeds in white-box and black-box settings against open-source and commercial routers while maintaining low perplexity, defeating perplexity-based filtering defenses

🛡️ Threat Analysis

Input Manipulation Attack

Confounder gadgets are query-independent adversarially optimized token sequences that manipulate LLM router classifiers at inference time, causing systematic misclassification of query complexity — a direct input manipulation attack on a classifier; the instructions explicitly classify 'attacks that trick an LLM system into using a more expensive model or path' as evasion/manipulation (ML01), not DoS.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_timetargeted
Applications
llm routing systemsllm orchestration / control planes