Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B
Ilham Wicaksono 1, Zekun Wu 1,2, Rahul Patel 2, Theo King 2, Adriano Koshiyama 1,2, Philip Treleaven 1
Published on arXiv
2509.17259
Prompt Injection
OWASP LLM Top 10 — LLM01
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Agentic-level iterative attacks compromise objectives that completely failed at the model level, with tool-calling contexts showing 24% higher vulnerability, while agentic-level prompts show 50-80% effectiveness degradation upon reinjection revealing context-dependent instability.
AgentSeer
Novel technique introduced
As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24\% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.
Key Contributions
- Empirically demonstrates that model-level jailbreak prompts do not reliably transfer to agentic deployments, with tool message injection achieving only 40% ASR vs. 57% for human message injection
- Identifies and characterizes 'agentic-only' vulnerabilities — attack vectors that succeed exclusively within agentic execution contexts (tool-calling contexts 24% more vulnerable) but fail against standalone models
- Introduces AgentSeer, an observability framework that deconstructs agentic systems into granular actions to enable component-level red teaming and comparative evaluation