ML Security Papers

Latest papers

13 papers

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

defense arXiv Apr 15, 2026 · 5w ago

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Hui Han, Shunli Wang, Yandan Zhao et al. · Shanghai Jiao Tong University · Tencent

Combines RAG and reinforcement learning to build an MLLM deepfake detector with dynamic forgery knowledge retrieval and critical reasoning

Output Integrity Attack visionmultimodalnlp

PDF

attack arXiv Apr 6, 2026 · 6w ago

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang et al. · UC Santa Cruz · National University of Singapore +4 more

Real-world evaluation showing poisoning of agent persistent state (skills, config, memory) increases attack success from 25% to 64-74% across four LLM backbones

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Mar 31, 2026 · 7w ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin et al. · Tsinghua University · University of Macau +1 more

Novel adversarial attack targeting dummy-class defenses by simultaneously attacking true and dummy labels with adaptive weighting

Input Manipulation Attack vision

PDF

attack arXiv Mar 4, 2026 · 11w ago

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Junchen Li, Chao Qi, Rongzheng Wang et al. · University of Electronic Science and Technology of China · Fudan University +1 more

Poisons RAG knowledge bases with alignment-exploiting documents that transfer blocking attacks across 7 LLMs with 96% success

Data Poisoning Attack Prompt Injection nlp

PDF

defense arXiv Feb 27, 2026 · 11w ago

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

Chung-ju Huang, Huiqiang Zhao, Yuanpeng He et al. · Peking University · Tencent +1 more

Defends LLM client prompts from cloud-provider reconstruction via CVM partitioning and reversible masking, cutting token inference accuracy from 97.5% to 1.34%

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Feb 4, 2026 · Feb 2026

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang et al. · Tsinghua University · National University of Singapore +1 more

First systematic taxonomy of training-time implicit safety risks in RL-trained LLMs, showing risky behaviors in 74.4% of runs

Model Skewing Excessive Agency nlpreinforcement-learning

PDF

benchmark arXiv Jan 9, 2026 · Jan 2026

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Dec 7, 2025 · Dec 2025

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Ruoxin Chen, Jiahui Gao, Kaiqing Lin et al. · Tencent · East China University of Science and Technology +2 more

Proposes task-model alignment combining VLMs and vision models for generalizable AI-generated image detection

Output Integrity Attack visionmultimodal

PDF

attack arXiv Nov 24, 2025 · Nov 2025

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

Yingjia Shang, Yi Liu, Huimin Wang et al. · Westlake University · Heilongjiang University +2 more

Black-box adversarial visual perturbations hijack retrieval in medical VLM-RAG systems, achieving 90%+ attack success via multi-positive InfoNCE loss and IRM-augmented optimization.

Input Manipulation Attack Prompt Injection visionmultimodalnlp

1 citations PDF Code

survey arXiv Oct 27, 2025 · Oct 2025

MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers

Bin Wang, Zexin Liu, Hao Yu et al. · Peking University · Tencent

Surveys and systematically classifies MCP server security threats: tool poisoning, web exploits, and supply chain risks with MCPGuard detection framework

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp

9 citations 1 influentialPDF

attack arXiv Oct 21, 2025 · Oct 2025

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Zheng Zhang, Jiarui He, Yuchen Cai et al. · The Hong Kong University of Science and Technology · Tencent +2 more

Evolves indirect prompt injection attacks against LLM web agents using genetic algorithms and a growing strategy library

Prompt Injection Excessive Agency Red-Team Agents nlp

PDF

attack arXiv Sep 20, 2025 · Sep 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Fengyuan Liu, Rui Zhao, Shuo Chen et al. · Tencent · University of Oxford +3 more

Attacks multi-agent LLM systems using optimized adversarial suffixes, misleading collective decisions with access to only one agent

Input Manipulation Attack Prompt Injection nlp

PDF Code

Latest papers

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue