Gouki Minegishi

Papers in Database (1)

defense arXiv Mar 6, 2026 · 4w ago

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng, Andrew Gambardella, Gouki Minegishi et al. · The University of Tokyo

Defends LLMs against jailbreaks via an explicit safety bit that makes alignment interpretable and overridable, achieving near-zero ASR

Prompt Injection nlp
PDF