ML Security Papers

Latest papers

75 papers

attack arXiv Apr 18, 2026 · 4w ago

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

Yuheng Chen, Zhiyu Wu, Bowen Cheng et al. · Kagoshima University · Fudan University +1 more

Bypasses LLM safety alignment by reformulating harmful prompts as forced-choice questions where all options violate policies

Prompt Injection nlp

PDF

defense arXiv Apr 9, 2026 · 6w ago

FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

Jinghan Yang, Yihe Fan, Xudong Pan et al. · Fudan University

Detects unsafe NSFW content during diffusion model generation via linear latent decoding, enabling early stopping with 97% memory reduction

Output Integrity Attack visiongenerative

PDF

attack arXiv Apr 8, 2026 · 6w ago

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan et al. · National University of Defense Technology · Alibaba Group +2 more

Backdoor attack embedding encrypted malicious payloads in agent skills, activated by triggers during skill composition

Model Poisoning Excessive Agency nlp

PDF

defense arXiv Apr 2, 2026 · 7w ago

From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

Yiheng Huang, Zhijia Zhao, Bihuan Chen et al. · Fudan University

Constructs dataset of 114 malicious MCP servers exploiting LLM tool-calling and proposes behavioral deviation detector achieving 94.6% F1

Insecure Plugin Design nlp

PDF

attack arXiv Apr 1, 2026 · 7w ago

Adversarial Attenuation Patch Attack for SAR Object Detection

Yiming Zhang, Weibo Qin, Feng Wang · Fudan University

Adversarial patch attack on SAR target detection achieving stealthiness and physical realizability through energy-constrained optimization

Input Manipulation Attack vision

PDF Code

attack arXiv Mar 25, 2026 · 8w ago

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

Yulin Shen, Xudong Pan, Geng Hong et al. · Fudan University · Shanghai Innovation Institute

Black-box tree-search attack generating stealthy injection payloads that hijack MCP-enabled LLM agents through manipulated tool responses

Prompt Injection Insecure Plugin Design nlp

PDF

tool arXiv Mar 23, 2026 · 8w ago

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Xinghan Li, Junhao Xu, Jingjing Chen · Fudan University

Interpretable deepfake detector using multimodal LLMs with part-grounded forensic reasoning and structured evidence verification

Output Integrity Attack visionmultimodalgenerative

PDF Code

benchmark arXiv Mar 8, 2026 · 10w ago

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li et al. · Singapore Management University · The University of Melbourne +1 more

Benchmarks beneficial uses of LLM backdoors for safety enforcement, access control, and watermarking via trigger conditioning

Model Poisoning Prompt Injection nlp

PDF Code

attack arXiv Mar 4, 2026 · 11w ago

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Junchen Li, Chao Qi, Rongzheng Wang et al. · University of Electronic Science and Technology of China · Fudan University +1 more

Poisons RAG knowledge bases with alignment-exploiting documents that transfer blocking attacks across 7 LLMs with 96% success

Data Poisoning Attack Prompt Injection nlp

PDF

defense arXiv Mar 2, 2026 · 11w ago

RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry

Xinchang Wang, Yunhao Chen, Yuechen Zhang et al. · Jiangnan University · Fudan University

Detects AI-generated images by exploiting feature drift asymmetry between real and synthetic images under structured perturbations

Output Integrity Attack vision

PDF Code

benchmark arXiv Feb 25, 2026 · 12w ago

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu, Qingsong Zhao, Yusong Wang et al. · China Telecom · Peking University +1 more

Proposes FAQ benchmark to evaluate VLMs on temporal deepfake detection via three-level forensic reasoning hierarchy

Output Integrity Attack visionmultimodal

PDF

defense arXiv Feb 25, 2026 · 12w ago

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao, Yuqi Li, Yunpeng Luo et al. · Tencent Youtu Lab · Fudan University

Fine-tunes Qwen 2.5 Omni as a unified audio-visual deepfake detector via two-stage LoRA and encoder fine-tuning

Output Integrity Attack multimodalaudiovision

PDF

defense arXiv Feb 10, 2026 · Feb 2026

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

Kun Wang, Zherui Li, Zhenhong Zhou et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +4 more

Exposes cross-modal jailbreak vulnerabilities in omni-modal LLMs and defends via SVD-guided refusal vector amplification with lightweight adapters

Prompt Injection multimodalnlp

PDF Code

defense arXiv Feb 4, 2026 · Feb 2026

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative

PDF

attack arXiv Feb 3, 2026 · Feb 2026

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Tianxin Chen, Wenbo Jiang, Hongqiao Chen et al. · Fudan University · University of Electronic Science and Technology of China +1 more

Backdoor attack on T2I diffusion models using semantic-space triggers that evade enumeration and attention-consistency defenses with 100% ASR

Model Poisoning visionnlpgenerativemultimodal

PDF

benchmark arXiv Feb 3, 2026 · Feb 2026

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

Yuxuan Liu, Yuntian Shi, Kun Wang et al. · Zhejiang University · Fudan University +1 more

Benchmark exposing cross-modal safety gaps in 16 VLMs via image-text combinations that bypass or confuse safety alignment

Prompt Injection multimodalvisionnlp

PDF

defense arXiv Feb 3, 2026 · Feb 2026

SEW: Strengthening Robustness of Black-box DNN Watermarking via Specificity Enhancement

Huming Qiu, Mi Zhang, Junjie Sun et al. · Fudan University · Alibaba Group

Defends DNN model ownership watermarks against removal attacks by reducing watermark association with approximate reverse-engineered keys

Model Theft vision

PDF

attack arXiv Feb 1, 2026 · Feb 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

attack arXiv Jan 30, 2026 · Jan 2026

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Zhixiang Zhang, Zesen Liu, Yuchong Xie et al. · The Hong Kong University of Science and Technology · Fudan University

CacheAttack exploits semantic cache collision vulnerabilities to hijack LLM responses at 86% success rate across major providers

Output Integrity Attack Prompt Injection nlp

PDF

defense arXiv Jan 30, 2026 · Jan 2026

Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

Nan Zhong, Yiran Xu, Mian Zou · City University of Hong Kong · Fudan University +1 more

Detects AI-generated images via camera CFA color correlations, achieving state-of-the-art generalization across 20+ unseen generators

Output Integrity Attack vision

PDF

Loading more papers…

Latest papers

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

Adversarial Attenuation Patch Attack for SAR Object Detection

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

SEW: Strengthening Robustness of Black-box DNN Watermarking via Specificity Enhancement

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue