ML Security Papers

attack arXiv Apr 16, 2026 · 5w ago

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang, Brian Y. Xie, Omar G. Younis · Santa Monica College · University of California +2 more

Unsafe agent behaviors transfer subliminally through distillation despite keyword filtering, achieving 100% deletion rates in students trained only on safe tasks

Transfer Learning Attack Data Poisoning Attack Excessive Agency nlp

PDF

Latest papers

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue