Yavuz Bakman

h-index: 6 130 citations 16 papers (total)

Papers in Database (1)

attack arXiv Jan 29, 2026 · 9w ago

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr et al. · University of Southern California

Proves static black-box alignment guarantees nothing post-update; constructs LLMs hiding latent jailbreak misalignment triggered by one benign gradient step

Model Poisoning Prompt Injection nlp
1 citations PDF