benchmark 2025

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

Ilham Wicaksono ¹, Zekun Wu ^1,2, Rahul Patel ², Theo King ², Adriano Koshiyama ^1,2, Philip Treleaven ¹

¹ University College London

² Holistic AI

0 citations · arXiv

Published on arXiv

2509.17259

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Agentic-level iterative attacks compromise objectives that completely failed at the model level, with tool-calling contexts showing 24% higher vulnerability, while agentic-level prompts show 50-80% effectiveness degradation upon reinjection revealing context-dependent instability.

AgentSeer

Novel technique introduced

As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24\% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.

Key Contributions

Empirically demonstrates that model-level jailbreak prompts do not reliably transfer to agentic deployments, with tool message injection achieving only 40% ASR vs. 57% for human message injection
Identifies and characterizes 'agentic-only' vulnerabilities — attack vectors that succeed exclusively within agentic execution contexts (tool-calling contexts 24% more vulnerable) but fail against standalone models
Introduces AgentSeer, an observability framework that deconstructs agentic systems into granular actions to enable component-level red teaming and comparative evaluation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

HarmBench

Applications

llm agentsagentic ai systemstool-augmented llms

Read PDF arXiv DOI

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

PEAR: Planner-Executor Agent Robustness Benchmark

Reliable Weak-to-Strong Monitoring of LLM Agents

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

ASTRA: Agentic Steerability and Risk Assessment Framework