Latest papers

2 papers
defense arXiv Feb 9, 2026 · 8w ago

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning, Jaylen Jones, Zhehao Zhang et al. · The Ohio State University · Amazon AGI

Guardrail system detects and corrects misaligned actions in computer-use agents, reducing indirect prompt injection attack success by 90%+

Prompt Injection Excessive Agency nlpmultimodal
PDF Code
defense arXiv Nov 18, 2025 · Nov 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna et al. · University of Southern California · Amazon AGI

Reveals that narrow refusal unlearning on LLMs triggers emergent misalignment in unrelated safety domains, and proposes a retain-data defense to contain it

Transfer Learning Attack Prompt Injection nlp
3 citations PDF