ML Security Papers

Latest papers

14 papers

attack arXiv Apr 6, 2026 · 6w ago

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang et al. · UC Santa Cruz · National University of Singapore +4 more

Real-world evaluation showing poisoning of agent persistent state (skills, config, memory) increases attack success from 25% to 64-74% across four LLM backbones

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Mar 4, 2026 · 11w ago

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser et al. · UC Berkeley · Google +1 more

Defends multimodal web agents against cross-modal DOM injection attacks using adversarial self-play RL across visual and text channels

Prompt Injection Excessive Agency multimodalreinforcement-learning

PDF

benchmark arXiv Feb 23, 2026 · 12w ago

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Xingyu Shen, Tommy Duong, Xiaodong An et al. · UC Berkeley · Duke University +4 more

Evaluates cosmetic physical attacks (beard, makeup, wrinkles) that fool age-estimation AI into misclassifying minors as adults, achieving up to 83% success rate

Input Manipulation Attack vision

PDF

attack arXiv Feb 22, 2026 · 12w ago

Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov et al. · University of Washington · Cornell University +2 more

Uses reinforcement learning to fine-tune LLMs and detect training data membership via active reconstruction, outperforming passive MIAs by 10.7%

Membership Inference Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Feb 12, 2026 · Feb 2026

MalTool: Malicious Tool Attacks on LLM Agents

Yuepeng Hu, Yuqi Jia, Mengyuan Li et al. · Duke University · UC Berkeley

Benchmarks malicious tool code attacks on LLM agents; coding LLMs generate evasive malware that defeats VirusTotal and agent-specific detectors

AI Supply Chain Attacks Insecure Plugin Design Exploit Generation nlp

PDF

defense arXiv Feb 3, 2026 · Feb 2026

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Xilong Wang, Yinuo Liu, Zhun Wang et al. · Duke University · UC Berkeley

Defends LLM web agents against indirect prompt injection by detecting and localizing malicious webpage segments

Prompt Injection nlp

PDF Code

defense arXiv Dec 15, 2025 · Dec 2025

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Subramanyam Sahoo · UC Berkeley

Defends against backdoored code-generating LLMs by checking execution trace consistency across semantically equivalent program variants

Model Poisoning nlp

PDF

attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp

2 citations PDF Code

attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp

10 citations PDF

attack arXiv Dec 3, 2025 · Dec 2025

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik et al. · MentaLeap · Independent Researcher +1 more

Jailbreaks LLMs by replacing harmful keywords with benign substitutes in-context, hijacking internal representations to bypass safety alignment

Prompt Injection nlp

PDF Code

defense CCS Nov 14, 2025 · Nov 2025

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Yiping Ma, Yue Guo, Harish Karthikeyan et al. · University of Pennsylvania · UC Berkeley +1 more

Byzantine-robust federated learning aggregation protocol using ZKPs and input validation, completing in just 3 rounds

Data Poisoning Attack Model Inversion Attack federated-learning

1 citations PDF

defense arXiv Oct 22, 2025 · Oct 2025

Defending Against Prompt Injection with DataFilter

Yizhu Wang, Sizhe Chen, Raghad Alkhudair et al. · UC Berkeley · KACST

Defends LLM agents against indirect prompt injection by filtering malicious instructions from external data before LLM processing

Prompt Injection nlp

9 citations PDF Code

tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp

4 citations PDF

attack arXiv Sep 30, 2025 · Sep 2025

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang et al. · AWS AI Labs · UC Berkeley

Multi-turn attack chains innocuous tool calls on LLM agents to achieve harmful goals, exceeding 90% ASR on GPT-4.1

Insecure Plugin Design Prompt Injection nlp

4 citations PDF Code

Latest papers

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Learning to Detect Language Model Training Data via Active Reconstruction

MalTool: Malicious Tool Attacks on LLM Agents

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

In-Context Representation Hijacking

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Defending Against Prompt Injection with DataFilter

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue