Stephen H. Bach

h-index: 9 446 citations 13 papers (total)

Papers in Database (1)

attack arXiv Oct 23, 2025 · Oct 2025

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Zheng-Xin Yong, Stephen H. Bach · Brown University

Discovers reasoning LLMs self-jailbreak via chain-of-thought after benign math/code fine-tuning, despite recognizing harmful requests

Transfer Learning Attack Prompt Injection nlp
PDF Code