David Lindner

h-index: 3 33 citations 3 papers (total)

Papers in Database (1)

benchmark arXiv Oct 21, 2025 · Oct 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp
6 citations PDF Code