ML Security Papers

Latest papers

118 papers

defense arXiv Apr 6, 2026 · 2d ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp

PDF

benchmark arXiv Apr 4, 2026 · 4d ago

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Wenhui Zhu, Xuanzhao Dong, Xiwen Chen et al. · Arizona State University · Morgan Stanley +5 more

Evaluates indirect prompt injection attacks on LLM agents across defenses, finding most fail while RepE-based circuit breakers achieve robust detection

Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal

PDF

defense arXiv Apr 4, 2026 · 4d ago

SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization

Hao Wang, Niels Mündler, Mark Vero et al. · University of California · ETH Zürich

Fine-tunes reasoning LLMs to internalize security reasoning, generating secure code by default without explicit security prompts

Prompt Injection nlp

PDF

attack arXiv Apr 1, 2026 · 7d ago

Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation

Halima Bouzidi, Haoyu Liu, Yonatan Gizachew Achamyeleh et al. · University of California

Adversarial attacks on multi-object trackers that flood query budgets and corrupt temporal memory to force track terminations

Input Manipulation Attack vision

PDF

benchmark arXiv Apr 1, 2026 · 7d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp

PDF Code

defense arXiv Mar 30, 2026 · 9d ago

Lipschitz verification of neural networks through training

Simon Kuang, Yuezhu Xu, S. Sivaranjani et al. · University of California · Purdue University

Trains certifiably robust neural networks by penalizing the trivial Lipschitz bound during training, achieving tight provable robustness guarantees

Input Manipulation Attack vision

PDF

benchmark arXiv Mar 12, 2026 · 27d ago

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Ching-Yu Kao, Xinfeng Li, Shenyu Dai et al. · Fraunhofer AISEC · Nanyang Technological University +3 more

Benchmarks documentation-embedded indirect prompt injection against high-privilege LLM agents, achieving 85% exfiltration success with 0% human detection rate

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Mar 12, 2026 · 27d ago

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu et al. · University of California · Johns Hopkins University +2 more

Analyzes refusal trigger mechanisms in LLM safety alignment to reduce overrefusal while maintaining jailbreak defenses

Prompt Injection nlp

PDF

survey arXiv Mar 11, 2026 · 28d ago

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Juhee Kim, Xiaoyuan Liu, Zhun Wang et al. · University of California · Seoul National University +1 more

Surveys attacks and defenses across agentic LLM systems, covering prompt injection, insecure tool use, and excessive agency risks

Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal

PDF

defense arXiv Mar 2, 2026 · 5w ago

Authenticated Contradictions from Desynchronized Provenance and Watermarking

Alexander Nemecek, Hengzhi He, Guang Cheng et al. · Case Western Reserve University · University of California

Exposes a provenance-watermark desync vulnerability producing cryptographically valid AI-generated 'authenticated fakes', defended by a cross-layer audit protocol

Output Integrity Attack visiongenerative

PDF

attack arXiv Mar 1, 2026 · 5w ago

Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents

Ben Dong, Hui Feng, Qian Wang · University of California

Trojanized LLM agent skill exploits tool-calling loops to achieve 6-9x token amplification in production OpenClaw deployments

Model Denial of Service Insecure Plugin Design nlp

PDF

defense arXiv Feb 28, 2026 · 5w ago

ROKA: Robust Knowledge Unlearning against Adversaries

Jinmyeong Shin, Joshua Tapia, Nicholas Ferreira et al. · University of California · California State University

Proposes ROKA defense against adversarial unlearning attacks that weaponize knowledge contamination to compromise security-critical model predictions without data manipulation

Model Skewing Model Poisoning visionnlpmultimodal

PDF

defense arXiv Feb 26, 2026 · 5w ago

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

Yanpei Guo, Wenjie Qu, Linyu Wu et al. · National University of Singapore · Nanyang Technological University +1 more

Auditing framework using verifiable computation to detect LLM provider fraud — model substitution, quantization abuse, token overbilling — with under 1% overhead

Output Integrity Attack nlp

PDF Code

survey arXiv Feb 23, 2026 · 6w ago

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Xiaochong Jiang, Shiqi Yang, Wenting Yang et al. · Northeastern University · New York University +2 more

Surveys runtime attack surfaces of agentic LLM systems, introducing the Viral Agent Loop self-propagating worm and a Zero-Trust defense architecture

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Feb 22, 2026 · 6w ago

TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery

Li Zhang, Shruti Agarwal, John Collomosse et al. · University of California · Adobe

Proactive multi-concept watermarking for diffusion models enabling independent IP attribution of styles and objects from generated images

Output Integrity Attack visiongenerative

PDF

defense arXiv Feb 19, 2026 · 6w ago

Towards Anytime-Valid Statistical Watermarking

Baihe Huang, Eric Xu, Kannan Ramchandran et al. · University of California

Proposes e-value-based LLM text watermarking with anytime-valid stopping, cutting detection token budget by 13–15%

Output Integrity Attack nlp

PDF

attack arXiv Feb 18, 2026 · 7w ago

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval · University of California · Harvard University

LoRA fine-tuning VLMs on narrow harmful datasets causes emergent safety misalignment that generalizes across modalities, with multimodal evaluation revealing 70% misalignment at rank 128

Transfer Learning Attack Prompt Injection multimodalvisionnlp

PDF

benchmark arXiv Feb 13, 2026 · 7w ago

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan et al. · Northeastern University · Virtue AI +2 more

Benchmarks multi-turn jailbreaks in tool-using LLM agents and proposes ToolShield, a self-exploration defense reducing ASR by 30%

Prompt Injection Insecure Plugin Design nlp

PDF Code

defense arXiv Feb 12, 2026 · 7w ago

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis · University of California

Detects and disrupts LLM jailbreaks at inference time using tensor decomposition of internal layer activations

Prompt Injection nlp

PDF

defense arXiv Feb 10, 2026 · 8w ago

Statistical Roughness-Informed Machine Unlearning

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West et al. · University of California · North Carolina State University

Spectral-stability-weighted machine unlearning algorithm that concentrates forgetting in stable layers, evaluated against membership inference leakage

Membership Inference Attack

PDF

Loading more papers…

Latest papers

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization

Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Lipschitz verification of neural networks through training

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Authenticated Contradictions from Desynchronized Provenance and Watermarking

Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents

ROKA: Robust Knowledge Unlearning against Adversaries

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery

Towards Anytime-Valid Statistical Watermarking

Narrow fine-tuning erodes safety alignment in vision-language agents

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Statistical Roughness-Informed Machine Unlearning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue