Latest papers

16 papers
survey Transactions on Machine Learni... Mar 30, 2026 · 7d ago

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur · Google · Bennett University

Surveys adversarial attacks on multimodal LLMs, organizing threats by attacker objectives and linking attacks to architectural vulnerabilities

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio
PDF
benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
defense arXiv Mar 4, 2026 · 4w ago

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser et al. · UC Berkeley · Google +1 more

Defends multimodal web agents against cross-modal DOM injection attacks using adversarial self-play RL across visual and text channels

Prompt Injection Excessive Agency multimodalreinforcement-learning
PDF
defense arXiv Feb 9, 2026 · 8w ago

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Feb 8, 2026 · 8w ago

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis et al. · Google Cloud AI Research · Seoul National University +2 more

Defends LLM tool-calling agents against indirect prompt injection via causal attribution-based dominance shift detection at privileged action points

Prompt Injection Excessive Agency nlp
PDF
survey IACR ePrint Dec 1, 2025 · Dec 2025

Systems Security Foundations for Agentic Computing

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda et al. · Google · University of California +5 more

Surveys agentic AI security through a systems-security lens, covering prompt injection, tool-use risks, and 11 real-world attack case studies

Prompt Injection Insecure Plugin Design Excessive Agency nlp
3 citations PDF
defense arXiv Nov 24, 2025 · Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Zihan Wang, Zhongkui Ma, Xinguo Feng et al. · The University of Queensland · CSIRO’s Data61 +3 more

Defends model IP with key-locked weights that survive fine-tuning, keeping unauthorized inference at near-random performance

Model Theft vision
1 citations PDF
defense arXiv Oct 31, 2025 · Oct 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja et al. · Google

Defends LLMs against jailbreaks and sycophancy via consistency training, making models invariant to adversarial prompt manipulations

Prompt Injection nlp
PDF
attack arXiv Oct 15, 2025 · Oct 2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

Yibo Peng, James Song, Lei Li et al. · Carnegie Mellon University · University of Michigan +3 more

Attacks LLM code agents via crafted issues to produce test-passing but security-vulnerable patches across 12 agent-model combinations

Prompt Injection nlp
PDF
defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp
3 citations PDF
defense arXiv Oct 6, 2025 · Oct 2025

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava et al. · Google · The University of Texas at Austin +2 more

Defends LLM tool-using agents from indirect prompt injection via adversarial RL co-training in a two-player zero-sum game

Prompt Injection nlpreinforcement-learning
3 citations PDF
defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp
1 citations PDF
attack CCS Oct 2, 2025 · Oct 2025

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Milad Nasr, Yanick Fratantonio, Luca Invernizzi et al. · Google DeepMind · OpenAI +2 more

Adversarial 13-byte modification evades Gmail's ML file-type routing model, bypassing the entire production malware detection pipeline

Input Manipulation Attack nlp
1 citations PDF
benchmark arXiv Sep 8, 2025 · Sep 2025

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

William Xu, Yiwei Lu, Yihan Wang et al. · University of Waterloo · University of Ottawa +3 more

Introduces three metrics—ergodic prediction accuracy, poison distance, and budget—to predict which test instances are most vulnerable to targeted data poisoning

Data Poisoning Attack vision
PDF
defense arXiv Aug 25, 2025 · Aug 2025

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Guangwei Zhang, Qisheng Su, Jiateng Liu et al. · City University of Hong Kong · Microsoft +4 more

Proactive LLM defense inspects internal states pre-generation to intercept copyrighted training data before disclosure

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
tool arXiv Aug 21, 2025 · Aug 2025

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al. · Columbia University · Brown University +4 more

Defends against malicious pickle-based ML models on Hugging Face via static analysis and dynamic policy enforcement at load time

AI Supply Chain Attacks
PDF