ML Security Papers

Latest papers

11 papers

defense arXiv Apr 11, 2026 · 5w ago

STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

Guijia Zhang, Shu Yang, Xilin Gong et al. · Shenzhen University · King Abdullah University of Science & Technology +2 more

Runtime risk-scoring system for LLM agent tool calls that detects indirect prompt injection attacks before execution

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF Code

attack arXiv Apr 7, 2026 · 6w ago

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

Syed Irfan Ali Meerza, Feiyi Wang, Jian Liu · University of Tennessee · Oak Ridge National Laboratory +1 more

Gradient-based attack reconstructing training data from federated LLMs at scale, working across architectures and PEFT methods

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

defense arXiv Apr 6, 2026 · 6w ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp

PDF

benchmark arXiv Apr 1, 2026 · 7w ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp

PDF Code

attack arXiv Feb 28, 2026 · 11w ago

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Ci Zhang, Zhaojun Ding, Chence Yang et al. · University of Georgia · Carnegie Mellon University +3 more

Attacks pruning-based unlearning in diffusion models by reviving erased concepts via side-channel signals from zeroed weight locations

Output Integrity Attack generativevision

PDF

attack arXiv Jan 30, 2026 · Jan 2026

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tanusree Debi, Wentian Zhu · University of Georgia

Red-teams Google's AP2 payment protocol via prompt injection attacks that hijack agent purchasing decisions and extract sensitive user payment data

Prompt Injection Sensitive Information Disclosure Red-Team Agents nlp

PDF

defense arXiv Jan 13, 2026 · Jan 2026

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Qitao Tan, Xiaoying Song, Ningxi Cheng et al. · University of Georgia · University of North Texas +2 more

Recovers LLM safety alignment eroded by fine-tuning via post-training quantization, without retraining, in 40 minutes on one GPU

Transfer Learning Attack Prompt Injection nlp

PDF Code

attack arXiv Dec 18, 2025 · Dec 2025

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Saksham Sahai Srivastava, Haoyu He · University of Georgia

Poisons LLM agent episodic memory via benign documents, causing persistent unsafe imitation of grafted experience records at retrieval time

Data Poisoning Attack Prompt Injection nlp

4 citations PDF Code

attack arXiv Oct 11, 2025 · Oct 2025

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

Wentian Zhu, Zhen Xiang, Wei Niu et al. · University of Georgia

Exploits LLM special tokens to construct jailbreak primitives that bypass both safety alignment and content moderation simultaneously

Prompt Injection nlp

PDF

benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal

4 citations PDF Code

defense arXiv Aug 19, 2025 · Aug 2025

Two Birds with One Stone: Multi-Task Detection and Attribution of LLM-Generated Text

Zixin Rao, Youssef Mohamed, Shang Liu et al. · University of Georgia · Egypt-Japan University of Science and Technology +1 more

Multi-task framework jointly detects LLM-generated text and attributes authorship to specific LLMs across languages

Output Integrity Attack nlp

PDF Code

Latest papers

STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Two Birds with One Stone: Multi-Task Detection and Attribution of LLM-Generated Text

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue