Latest papers

1 papers
attack arXiv Apr 16, 2026 · 5w ago

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang, Brian Y. Xie, Omar G. Younis · Santa Monica College · University of California +2 more

Unsafe agent behaviors transfer subliminally through distillation despite keyword filtering, achieving 100% deletion rates in students trained only on safe tasks

Transfer Learning Attack Data Poisoning Attack Excessive Agency nlp
PDF