ML Security Papers

Latest papers

12 papers

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Mar 25, 2026 · 8w ago

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

Yulin Shen, Xudong Pan, Geng Hong et al. · Fudan University · Shanghai Innovation Institute

Black-box tree-search attack generating stealthy injection payloads that hijack MCP-enabled LLM agents through manipulated tool responses

Prompt Injection Insecure Plugin Design nlp

PDF

survey arXiv Mar 2, 2026 · 11w ago

From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions

Zhihang Deng, Jiaping Gui, Weinan Zhang · Shanghai Innovation Institute · Shanghai Jiao Tong University

Surveys prompt injection, toolchain abuse, and agent network threats across LLM agentic systems and web-scale deployments

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Jan 21, 2026 · Jan 2026

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu et al. · Shanghai Jiao Tong University · Shanghai Artificial Intelligence Laboratory +1 more

Defends LLM multi-agent systems from viral malicious propagation by detecting and rehabilitating infected agents with topological constraints

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Jan 19, 2026 · Jan 2026

MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction

Wenqi Zhang, Yulin Shen, Changyue Jiang et al. · Fudan University · Shanghai Innovation Institute

Defends LLM computer-use agents against prompt/visual injection by training on simulated unsafe GUI trajectories to correct reasoning chains

Prompt Injection Excessive Agency nlpvisionmultimodal

PDF Code

benchmark arXiv Jan 15, 2026 · Jan 2026

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Xingjun Ma, Yixu Wang, Hengyuan Xu et al. · Fudan University · Shanghai Innovation Institute +2 more

Benchmarks six frontier LLMs/VLMs on adversarial, multilingual, and compliance safety, revealing all collapse below 6% worst-case safety rates

Prompt Injection nlpmultimodalvisiongenerative

1 citations PDF

benchmark arXiv Jan 13, 2026 · Jan 2026

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

Xinyi Wu, Jiagui Chen, Geng Hong et al. · Fudan University · Shanghai Innovation Institute

Automated benchmark with 1,226 tasks evaluating LLM web agent security across prompt injection and excessive agency risks

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Jan 12, 2026 · Jan 2026

When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

Xinyi Wu, Geng Hong, Yueyue Chen et al. · Fudan University · Zhongguancun Laboratory +2 more

Discovers social engineering attacks hijack LLM web agents via malicious webpage content; proposes runtime defense reducing attack success by 78%

Prompt Injection Excessive Agency nlp

1 citations PDF

benchmark arXiv Jan 8, 2026 · Jan 2026

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng, Yige Li, Yutao Wu et al. · Fudan University · Alibaba Group +4 more

Benchmark framework systematizing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows

Model Poisoning Excessive Agency nlpmultimodal

1 citations PDF Code

defense arXiv Nov 29, 2025 · Nov 2025

SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Yongkang Hu, Yu Cheng, Yushuo Zhang et al. · East China Normal University · Shanghai Innovation Institute

Continual-learning detection framework for AI-generated images using scene-aware expert modules and gradient-projection to prevent forgetting

Output Integrity Attack vision

PDF

tool arXiv Oct 10, 2025 · Oct 2025

Provable Training Data Identification for Large Language Models

Zhenlong Liu, Hao Zeng, Weiran Huang et al. · Southern University of Science and Technology · Shanghai Innovation Institute +1 more

Set-level membership inference for LLMs with provable false identification rate control via conformal p-values and BH procedure

Membership Inference Attack nlp

PDF

defense arXiv Oct 8, 2025 · Oct 2025

AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Boyi Zeng, Lin Chen, Ziwei He et al. · Shanghai Jiao Tong University · Fudan University +1 more

Training-free LLM weight-matrix fingerprinting detects model lineage with perfect AUC, robust to six post-training modification types

Model Theft Model Theft nlp

PDF Code

Latest papers

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Provable Training Data Identification for Large Language Models

AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue