Jacob Drori

h-index: 1 6 citations 2 papers (total)

Papers in Database (1)

defense arXiv Oct 11, 2025 · Oct 2025

Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth et al. · MATS

Reveals that output-only RL supervision still obfuscates LLM chain-of-thought, and proposes two mitigations to preserve CoT monitorability

Prompt Injection nlpreinforcement-learning
1 citations PDF Code