Latest papers

13 papers
defense arXiv Mar 4, 2026 · 4w ago

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser et al. · UC Berkeley · Google +1 more

Defends multimodal web agents against cross-modal DOM injection attacks using adversarial self-play RL across visual and text channels

Prompt Injection Excessive Agency multimodalreinforcement-learning
PDF
benchmark arXiv Feb 23, 2026 · 6w ago

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Xingyu Shen, Tommy Duong, Xiaodong An et al. · UC Berkeley · Duke University +4 more

Evaluates cosmetic physical attacks (beard, makeup, wrinkles) that fool age-estimation AI into misclassifying minors as adults, achieving up to 83% success rate

Input Manipulation Attack vision
PDF
attack arXiv Feb 22, 2026 · 6w ago

Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov et al. · University of Washington · Cornell University +2 more

Uses reinforcement learning to fine-tune LLMs and detect training data membership via active reconstruction, outperforming passive MIAs by 10.7%

Membership Inference Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 12, 2026 · 7w ago

MalTool: Malicious Tool Attacks on LLM Agents

Yuepeng Hu, Yuqi Jia, Mengyuan Li et al. · Duke University · UC Berkeley

Benchmarks malicious tool code attacks on LLM agents; coding LLMs generate evasive malware that defeats VirusTotal and agent-specific detectors

AI Supply Chain Attacks Insecure Plugin Design nlp
PDF
defense arXiv Feb 3, 2026 · 8w ago

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Xilong Wang, Yinuo Liu, Zhun Wang et al. · Duke University · UC Berkeley

Defends LLM web agents against indirect prompt injection by detecting and localizing malicious webpage segments

Prompt Injection nlp
PDF Code
defense arXiv Dec 15, 2025 · Dec 2025

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Subramanyam Sahoo · UC Berkeley

Defends against backdoored code-generating LLMs by checking execution trace consistency across semantically equivalent program variants

Model Poisoning nlp
PDF
attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp
2 citations PDF Code
attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
10 citations PDF
attack arXiv Dec 3, 2025 · Dec 2025

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik et al. · MentaLeap · Independent Researcher +1 more

Jailbreaks LLMs by replacing harmful keywords with benign substitutes in-context, hijacking internal representations to bypass safety alignment

Prompt Injection nlp
PDF Code
defense CCS Nov 14, 2025 · Nov 2025

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Yiping Ma, Yue Guo, Harish Karthikeyan et al. · University of Pennsylvania · UC Berkeley +1 more

Byzantine-robust federated learning aggregation protocol using ZKPs and input validation, completing in just 3 rounds

Data Poisoning Attack Model Inversion Attack federated-learning
1 citations PDF
defense arXiv Oct 22, 2025 · Oct 2025

Defending Against Prompt Injection with DataFilter

Yizhu Wang, Sizhe Chen, Raghad Alkhudair et al. · UC Berkeley · KACST

Defends LLM agents against indirect prompt injection by filtering malicious instructions from external data before LLM processing

Prompt Injection nlp
9 citations PDF Code
tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp
4 citations PDF
attack arXiv Sep 30, 2025 · Sep 2025

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang et al. · AWS AI Labs · UC Berkeley

Multi-turn attack chains innocuous tool calls on LLM agents to achieve harmful goals, exceeding 90% ASR on GPT-4.1

Insecure Plugin Design Prompt Injection nlp
4 citations PDF Code