David Lindner

benchmark arXiv Oct 21, 2025 · Oct 2025

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp

6 citations PDF Code

Papers in Database (1)