attack 2025

Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning

Amine Andam 1, Jamal Bentahar 2,3, Mustapha Hedabou 1

0 citations

α

Published on arXiv

2508.09275

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves effective sabotage of deployed c-MARL systems using only 1,000 samples compared to millions required by prior black-box methods, validated across diverse algorithms and 22 environments.

Align attack / Hadamard attack

Novel technique introduced


Collaborative multi-agent reinforcement learning has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more challenging and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all (no observations, actions, or weights). Our main approach is to generate perturbations that intentionally misalign how victim agents see their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.


Key Contributions

  • Align attack: crafts observation perturbations that intentionally misalign agents' views of shared environment state, requiring only 1,000 samples vs. millions for prior surrogate-based black-box methods
  • Hadamard attack: zero-access structured perturbation attack using partial Hadamard (orthogonal) matrices that induces misalignment without any observations, actions, or policy weights
  • Combined targeted attack that uses Align's agent-profiling capability with Hadamard's efficient perturbation generation, validated across 3 benchmarks and 22 environments

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial perturbations applied to deployed agents' observations at inference/test time to cause environmental state misalignment and disrupt coordination — a novel input manipulation attack on deployed RL policies requiring no access to weights, actions, or architecture.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
black_boxinference_timeuntargeteddigital
Datasets
MARL benchmarks (3 benchmarks, 22 cooperative tasks including Pursuit game)
Applications
cooperative multi-agent reinforcement learningmulti-agent systems