Latest papers

10 papers
attack arXiv Mar 31, 2026 · 6d ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin et al. · Tsinghua University · University of Macau +1 more

Novel adversarial attack targeting dummy-class defenses by simultaneously attacking true and dummy labels with adaptive weighting

Input Manipulation Attack vision
PDF
attack arXiv Mar 4, 2026 · 4w ago

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Junchen Li, Chao Qi, Rongzheng Wang et al. · University of Electronic Science and Technology of China · Fudan University +1 more

Poisons RAG knowledge bases with alignment-exploiting documents that transfer blocking attacks across 7 LLMs with 96% success

Data Poisoning Attack Prompt Injection nlp
PDF
defense arXiv Feb 27, 2026 · 5w ago

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

Chung-ju Huang, Huiqiang Zhao, Yuanpeng He et al. · Peking University · Tencent +1 more

Defends LLM client prompts from cloud-provider reconstruction via CVM partitioning and reversible masking, cutting token inference accuracy from 97.5% to 1.34%

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 4, 2026 · 8w ago

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang et al. · Tsinghua University · National University of Singapore +1 more

First systematic taxonomy of training-time implicit safety risks in RL-trained LLMs, showing risky behaviors in 74.4% of runs

Model Skewing Excessive Agency nlpreinforcement-learning
PDF
benchmark arXiv Jan 9, 2026 · 12w ago

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp
PDF Code
defense arXiv Dec 7, 2025 · Dec 2025

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Ruoxin Chen, Jiahui Gao, Kaiqing Lin et al. · Tencent · East China University of Science and Technology +2 more

Proposes task-model alignment combining VLMs and vision models for generalizable AI-generated image detection

Output Integrity Attack visionmultimodal
PDF
attack arXiv Nov 24, 2025 · Nov 2025

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

Yingjia Shang, Yi Liu, Huimin Wang et al. · Westlake University · Heilongjiang University +2 more

Black-box adversarial visual perturbations hijack retrieval in medical VLM-RAG systems, achieving 90%+ attack success via multi-positive InfoNCE loss and IRM-augmented optimization.

Input Manipulation Attack Prompt Injection visionmultimodalnlp
1 citations PDF Code
survey arXiv Oct 27, 2025 · Oct 2025

MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers

Bin Wang, Zexin Liu, Hao Yu et al. · Peking University · Tencent

Surveys and systematically classifies MCP server security threats: tool poisoning, web exploits, and supply chain risks with MCPGuard detection framework

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp
9 citations 1 influentialPDF
attack arXiv Oct 21, 2025 · Oct 2025

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Zheng Zhang, Jiarui He, Yuchen Cai et al. · The Hong Kong University of Science and Technology · Tencent +2 more

Evolves indirect prompt injection attacks against LLM web agents using genetic algorithms and a growing strategy library

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Sep 20, 2025 · Sep 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Fengyuan Liu, Rui Zhao, Shuo Chen et al. · Tencent · University of Oxford +3 more

Attacks multi-agent LLM systems using optimized adversarial suffixes, misleading collective decisions with access to only one agent

Input Manipulation Attack Prompt Injection nlp
PDF Code