Luke Marks

h-index: 3 30 citations 7 papers (total)

Papers in Database (1)

defense arXiv Oct 11, 2025 · Oct 2025

Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth et al. · MATS

Reveals that output-only RL supervision still obfuscates LLM chain-of-thought, and proposes two mitigations to preserve CoT monitorability

Prompt Injection nlpreinforcement-learning
1 citations PDF Code