Usman Anwar

defense arXiv Feb 26, 2026 · 5w ago

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp

Papers in Database (1)