defense 2026

The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

Jiayuan Liu 1,2, Shiyi Du 1, Weihua Du 1, Mingyu Guo 3, Vincent Conitzer 1,2

0 citations

α

Published on arXiv

2604.17139

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Token-level RR maintains robust accuracy well beyond 50% corruption ratio where majority voting strictly collapses

Token-Level Round-Robin (RR) Collaboration

Novel technique introduced


Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.


Key Contributions

  • Reveals that response-level aggregation (majority voting) collapses when >50% of agents are corrupted by prompt injections
  • Proposes Token-Level Round-Robin (RR) collaboration where agents sequentially interleave generation in shared context
  • Proves theoretically and demonstrates empirically that token-level interleaving maintains robustness beyond 50% corruption threshold where majority voting fails

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetargeted
Applications
multi-agent llm systemsautonomous research assistantstravel planning agentscollaborative reasoning systems