benchmark 2025

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan Nöther , Adish Singla , Goran Radanovic

Max Planck Institute for Software Systems

0 citations

Published on arXiv

2508.16481

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

A single adversarial agent within a multi-agent LLM system achieves high success rates at inducing harmful actions across all tested environments, remaining effective even when other agents apply prompting-based defenses.

BAD-ACTS

Novel technique introduced

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS

Key Contributions

Novel taxonomy of harms specifically designed for LLM-based agentic systems across diverse application environments
BAD-ACTS benchmark: 4 agentic system implementations in distinct environments plus 188 high-quality harmful action examples covering a wide range of harm categories and inter-agent communication structures
Empirical demonstration that a single adversarial agent achieves high harmful-action success rates even against prompting-based defenses, and a more effective message-monitoring defense strategy

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timetargetedgrey_box

Datasets

BAD-ACTS

Applications

llm-based agentic systemsmulti-agent systems

Read PDF arXiv Code

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reliable Weak-to-Strong Monitoring of LLM Agents

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

PEAR: Planner-Executor Agent Robustness Benchmark