ML Security Papers

Latest papers

2 papers

defense arXiv Feb 9, 2026 · 8w ago

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning, Jaylen Jones, Zhehao Zhang et al. · The Ohio State University · Amazon AGI

Guardrail system detects and corrects misaligned actions in computer-use agents, reducing indirect prompt injection attack success by 90%+

Prompt Injection Excessive Agency nlpmultimodal

PDF Code

defense arXiv Nov 18, 2025 · Nov 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna et al. · University of Southern California · Amazon AGI

Reveals that narrow refusal unlearning on LLMs triggers emergent misalignment in unrelated safety domains, and proposes a retain-data defense to contain it

Transfer Learning Attack Prompt Injection nlp

3 citations PDF

Latest papers

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue