attack 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

0 citations

Published on arXiv

2509.16494

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

M-Spoiler successfully misleads collective multi-agent LLM decisions using adversarial suffixes generated from only a single known agent, remaining more effective than baselines even under multiple defense mechanisms across 9 models and 7 datasets.

M-Spoiler

Novel technique introduced

Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.

Key Contributions

Frames multi-agent LLM manipulation as an incomplete-information game where the adversary has white-box access to one agent but no knowledge of the others
Proposes M-Spoiler, introducing a simulated stubborn agent and critical agent to optimize adversarial suffixes that transfer across the full multi-agent system
Demonstrates empirically across 9 LLMs and 7 datasets that single-agent knowledge is sufficient to compromise collective multi-agent decisions, and that M-Spoiler outperforms baselines under several defense strategies

🛡️ Threat Analysis

Input Manipulation Attack

M-Spoiler generates gradient-optimized adversarial token suffixes (GCG-style) targeting LLMs at inference time to cause misclassification and misleading outputs — a direct adversarial suffix attack evaluated on AdvBench and classification tasks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

grey_boxinference_timetargeted

Datasets

AdvBenchSST-2CoLARTEQQPAlgebraGSM8K

Applications

multi-agent llm systemscollaborative decision-makinghealthcare aidrug discovery ai

Read PDF arXiv Code

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

CompressionAttack: Exploiting Prompt Compression as a New Attack Surface in LLM-Powered Agents

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

H-Node Attack and Defense in Large Language Models