benchmark 2025

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha ¹, Jifan Chen ², Sam Mayers ², Sanjay Krishna Gouda ², Zijian Wang ³, Varun Kumar ²

¹ University of Maryland - College Park

² AWS AI Labs

³ Meta

2 citations · 1 influential · 30 references · arXiv

Published on arXiv

2510.01359

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Wrapping an LLM in a code agent increases attack success rate by 1.6x, reaching ~75% ASR in the multi-file regime with 32% of outputs being instantly deployable malicious code

JAWS-BENCH

Novel technique introduced

Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6x -- because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi-step reasoning and tool use.

Key Contributions

JAWS-BENCH: a three-regime benchmark (empty, single-file, multi-file workspaces) for evaluating jailbreaks on code-capable LLM agents
Hierarchical executable-aware Judge Framework measuring compliance, attack success, syntactic correctness, and runtime executability — going beyond refusal to measure deployable harm
Empirical finding that agent wrapping increases ASR by 1.6x (mean ~75% in multi-file regime, 32% instantly deployable) due to refusal-overturning during tool-use steps

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

JAWS-BENCH

Applications

code agentsllm-based software engineering toolsai coding assistants

Read PDF arXiv DOI

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks