ML Security Papers

Latest papers

18 papers

attack arXiv Apr 23, 2026 · 28d ago

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

Harsh Kumar, Rahul Maity, Tanmay Joshi et al. · Manipal University Jaipur · National Institute of Technology Karnataka +3 more

Web-scale poisoning attack planting dormant backdoor triggers in LLM pretraining corpora via stealth websites indexed by Common Crawl

Data Poisoning Attack Model Poisoning AI Supply Chain Attacks Training Data Poisoning nlp

PDF Code

defense arXiv Apr 21, 2026 · 4w ago

An AI Agent Execution Environment to Safeguard User Data

Robert Stanley, Avi Verma, Lillian Tsai et al. · University of California · Google

Information flow control system for AI agents that blocks prompt injection data exfiltration attacks while enforcing user privacy policies

Prompt Injection Sensitive Information Disclosure Excessive Agency nlp

PDF

survey Transactions on Machine Learni... Mar 30, 2026 · 7w ago

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur · Google · Bennett University

Surveys adversarial attacks on multimodal LLMs, organizing threats by attacker objectives and linking attacks to architectural vulnerabilities

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio

PDF

benchmark arXiv Mar 19, 2026 · 9w ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative

PDF Code

defense arXiv Mar 4, 2026 · 11w ago

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser et al. · UC Berkeley · Google +1 more

Defends multimodal web agents against cross-modal DOM injection attacks using adversarial self-play RL across visual and text channels

Prompt Injection Excessive Agency multimodalreinforcement-learning

PDF

defense arXiv Feb 9, 2026 · Feb 2026

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Feb 8, 2026 · Feb 2026

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis et al. · Google Cloud AI Research · Seoul National University +2 more

Defends LLM tool-calling agents against indirect prompt injection via causal attribution-based dominance shift detection at privileged action points

Prompt Injection Excessive Agency nlp

PDF

survey IACR ePrint Dec 1, 2025 · Dec 2025

Systems Security Foundations for Agentic Computing

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda et al. · Google · University of California +5 more

Surveys agentic AI security through a systems-security lens, covering prompt injection, tool-use risks, and 11 real-world attack case studies

Prompt Injection Insecure Plugin Design Excessive Agency nlp

3 citations PDF

defense arXiv Nov 24, 2025 · Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Zihan Wang, Zhongkui Ma, Xinguo Feng et al. · The University of Queensland · CSIRO’s Data61 +3 more

Defends model IP with key-locked weights that survive fine-tuning, keeping unauthorized inference at near-random performance

Model Theft vision

1 citations PDF

Deep neural networks (DNNs) have become valuable intellectual property of model owners, due to the substantial resources required for their development. To protect these assets in the deployed environment, recent research has proposed model usage control mechanisms to ensure models cannot be used without proper authorization. These methods typically lock the utility of the model by embedding an access key into its parameters. However, they often assume static deployment, and largely fail to withstand continual post-deployment model updates, such as fine-tuning or task-specific adaptation. In this paper, we propose ADALOC, to endow key-based model usage control with adaptability during model evolution. It strategically selects a subset of weights as an intrinsic access key, which enables all model updates to be confined to this key throughout the evolution lifecycle. ADALOC enables using the access key to restore the keyed model to the latest authorized states without redistributing the entire network (i.e., adaptation), and frees the model owner from full re-keying after each model update (i.e., lock preservation). We establish a formal foundation to underpin ADALOC, providing crucial bounds such as the errors introduced by updates restricted to the access key. Experiments on standard benchmarks, such as CIFAR-100, Caltech-256, and Flowers-102, and modern architectures, including ResNet, DenseNet, and ConvNeXt, demonstrate that ADALOC achieves high accuracy under significant updates while retaining robust protections. Specifically, authorized usages consistently achieve strong task-specific performance, while unauthorized usage accuracy drops to near-random guessing levels (e.g., 1.01% on CIFAR-100), compared to up to 87.01% without ADALOC. This shows that ADALOC can offer a practical solution for adaptive and protected DNN deployment in evolving real-world scenarios.

cnn The University of Queensland · CSIRO’s Data61 · City University of Hong Kong +2 more

PDF arXiv DOI

defense arXiv Oct 31, 2025 · Oct 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja et al. · Google

Defends LLMs against jailbreaks and sycophancy via consistency training, making models invariant to adversarial prompt manipulations

Prompt Injection nlp

PDF

attack arXiv Oct 15, 2025 · Oct 2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

Yibo Peng, James Song, Lei Li et al. · Carnegie Mellon University · University of Michigan +3 more

Attacks LLM code agents via crafted issues to produce test-passing but security-vulnerable patches across 12 agent-model combinations

Prompt Injection nlp

PDF

defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp

3 citations PDF

defense arXiv Oct 6, 2025 · Oct 2025

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava et al. · Google · The University of Texas at Austin +2 more

Defends LLM tool-using agents from indirect prompt injection via adversarial RL co-training in a two-player zero-sum game

Prompt Injection nlpreinforcement-learning

3 citations PDF

defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp

1 citations PDF

attack CCS Oct 2, 2025 · Oct 2025

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Milad Nasr, Yanick Fratantonio, Luca Invernizzi et al. · Google DeepMind · OpenAI +2 more

Adversarial 13-byte modification evades Gmail's ML file-type routing model, bypassing the entire production malware detection pipeline

Input Manipulation Attack nlp

1 citations PDF

benchmark arXiv Sep 8, 2025 · Sep 2025

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

William Xu, Yiwei Lu, Yihan Wang et al. · University of Waterloo · University of Ottawa +3 more

Introduces three metrics—ergodic prediction accuracy, poison distance, and budget—to predict which test instances are most vulnerable to targeted data poisoning

Data Poisoning Attack vision

PDF

defense arXiv Aug 25, 2025 · Aug 2025

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Guangwei Zhang, Qisheng Su, Jiateng Liu et al. · City University of Hong Kong · Microsoft +4 more

Proactive LLM defense inspects internal states pre-generation to intercept copyrighted training data before disclosure

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

tool arXiv Aug 21, 2025 · Aug 2025

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al. · Columbia University · Brown University +4 more

Defends against malicious pickle-based ML models on Hugging Face via static analysis and dynamic policy enforcement at load time

AI Supply Chain Attacks

PDF

Latest papers

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

An AI Agent Execution Environment to Safeguard User Data

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Reinforcement Learning with Backtracking Feedback

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Systems Security Foundations for Agentic Computing

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Consistency Training Helps Stop Sycophancy and Jailbreaks

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

A2AS: Agentic AI Runtime Security and Self-Defense

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue