benchmark arXiv Oct 21, 2025 · Oct 2025
Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more
Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%
Prompt Injection nlp
Recent findings suggest that misaligned models may exhibit deceptive behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT) is a promising tool for alignment monitoring: when models articulate their reasoning faithfully, monitors can detect and mitigate harmful behaviors before undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection? To answer this question and thus stress-test CoT monitorability, we develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT (prompted reasoning in outputs) using toy tasks and more realistic environments in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and efficiently without obfuscation pressure. (ii) Under strong obfuscation pressure, some models successfully complete adversarial tasks while evading detection. (iii) Models do not obfuscate their internal CoT as much as their external CoT (under prompt pressure). These results suggest that while CoT provides valuable oversight in benign settings, robust deployment requires model-specific stress-testing of monitorability.
llm ETH Zürich · ML Alignment & Theory Scholars · Google DeepMind
attack arXiv Oct 19, 2025 · Oct 2025
Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more
Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections
Input Manipulation Attack Prompt Injection visionnlpmultimodal
We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).
llm vlm ETH Zürich · University at Buffalo · ByteDance
defense arXiv Jan 14, 2026 · 11w ago
Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more
Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat
Prompt Injection Excessive Agency nlpmultimodal
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) -- systems that automate tasks by viewing screens and executing actions -- presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.
llm vlm University of Cambridge · University of Toronto · Vector Institute +2 more
attack arXiv Feb 5, 2026 · 8w ago
Xin Chen, Jie Zhang, Florian Tramèr · ETH Zürich
RL-trained 1.5B model generates universal, transferable prompt injection suffixes that compromise GPT, Claude, and Gemini agents
Prompt Injection nlp
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
llm ETH Zürich