benchmark 2025

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan Nöther , Adish Singla , Goran Radanovic

0 citations

α

Published on arXiv

2508.16481

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

A single adversarial agent within a multi-agent LLM system achieves high success rates at inducing harmful actions across all tested environments, remaining effective even when other agents apply prompting-based defenses.

BAD-ACTS

Novel technique introduced


Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS


Key Contributions

  • Novel taxonomy of harms specifically designed for LLM-based agentic systems across diverse application environments
  • BAD-ACTS benchmark: 4 agentic system implementations in distinct environments plus 188 high-quality harmful action examples covering a wide range of harm categories and inter-agent communication structures
  • Empirical demonstration that a single adversarial agent achieves high harmful-action success rates even against prompting-based defenses, and a more effective message-monitoring defense strategy

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetargetedgrey_box
Datasets
BAD-ACTS
Applications
llm-based agentic systemsmulti-agent systems