Robert Mullins

h-index: 14 1,072 citations 43 papers (total)

Papers in Database (4)

benchmark arXiv Oct 21, 2025 · Oct 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp
6 citations PDF Code
attack arXiv Oct 19, 2025 · Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more

Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections

Input Manipulation Attack Prompt Injection visionnlpmultimodal
2 citations PDF Code
defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
attack arXiv Feb 5, 2026 · 8w ago

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramèr · ETH Zürich

RL-trained 1.5B model generates universal, transferable prompt injection suffixes that compromise GPT, Claude, and Gemini agents

Prompt Injection nlp
PDF