ML Security Papers

Latest papers

34 papers

benchmark arXiv Apr 29, 2026 · 22d ago

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Junbin Yang et al. · University of Chinese Academy of Sciences · Inner Mongolia University of Technology +1 more

Mechanistic analysis showing adversarial fine-tuning reorganizes LLM refusal representations across layers while navigating robustness-utility tradeoffs

Prompt Injection nlp

PDF

defense arXiv Apr 11, 2026 · 5w ago

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

Guangyu Gong, Zizhuang Deng · Shandong University

Training-free defense isolating agent planning from retrieved content to block indirect prompt injection with zero attack success

Prompt Injection nlp

PDF Code

defense arXiv Mar 25, 2026 · 8w ago

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Hongyi Miao, Jun Jia, Xincheng Wang et al. · Shandong University · Shanghai Jiao Tong University +4 more

Data poisoning defense that protects private photo datasets from VLM fine-tuning attacks that extract identity-affiliation relationships

Data Poisoning Attack Sensitive Information Disclosure visionnlpmultimodal

PDF

Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model's internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user's private information upon input of their photos. To benchmark VLMs' susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs'encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

vlm transformer multimodal Shandong University · Shanghai Jiao Tong University · Donghua University +3 more

PDF arXiv

attack arXiv Mar 24, 2026 · 8w ago

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Wenyu Chen, Xiangtao Meng, Chuanchao Zang et al. · Shandong University

Token-aware jailbreak fuzzing that achieves 90% attack success with 70% fewer queries by prioritizing high-contribution tokens

Prompt Injection nlp

PDF

attack arXiv Mar 23, 2026 · 8w ago

Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

Chengyin Hu, Yikun Guo, Yuxian Dong et al. · China University of Petroleum-Beijing · University of Electronic Science and Technology of China +3 more

Universal adversarial patch attack on infrared pedestrian detectors using parameterized Bézier curves and cold patches

Input Manipulation Attack vision

PDF

attack arXiv Mar 20, 2026 · 8w ago

Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs

Qi Luo, Minghui Xu, Dongxiao Yu et al. · Shandong University

Text-only backdoor attack on graph neural networks that poisons node text while preserving graph structure, achieving near-perfect attack success rates

Model Poisoning Data Poisoning Attack nlpgraph

PDF

attack arXiv Mar 18, 2026 · 9w ago

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Qianlong Xiang, Miao Zhang, Haoyu Zhang et al. · Harbin Institute of Technology · City University of Hong Kong +3 more

Text-free inversion attack that recovers supposedly erased concepts from diffusion models by exploiting persistent visual knowledge

Model Inversion Attack visiongenerative

PDF

defense arXiv Mar 14, 2026 · 9w ago

Towards Generalizable Deepfake Detection via Real Distribution Bias Correction

Ming-Hui Liu, Harry Cheng, Xin Luo et al. · Shandong University · National University of Singapore

Deepfake detector exploiting real image distribution invariance to generalize across unseen forgery types and domains

Output Integrity Attack vision

PDF

defense arXiv Mar 11, 2026 · 10w ago

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Zhengyang Shan, Jiayun Xin, Yue Zhang et al. · Shandong University

Analyzes LLM code agent vulnerabilities via 47 attack scenarios, then defends with Human-in-the-Loop tool-call interception raising defense rates from 17% to 92%

Prompt Injection Excessive Agency nlp

PDF Code

benchmark arXiv Mar 8, 2026 · 10w ago

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

Yuhang Huang, Boyang Ma, Biwei Yan et al. · Shandong University · City University of Hong Kong

Large-scale empirical analysis reveals MCP servers fail to authenticate callers, enabling unauthorized tool access in LLM agent systems

Insecure Plugin Design nlp

PDF

attack arXiv Feb 11, 2026 · Feb 2026

When Skills Lie: Hidden-Comment Injection in LLM Agents

Qianli Wang, Boyang Ma, Minghui Xu et al. · Shandong University

Demonstrates hidden-comment prompt injection in LLM agent Skill documents, invisible to humans but followed by models, triggering malicious tool calls

Prompt Injection Insecure Plugin Design nlp

PDF

benchmark arXiv Feb 3, 2026 · Feb 2026

Don't believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions

Zhihao Li, Boyang Ma, Xuelong Dai et al. · Shandong University

Measures description-code inconsistency across 10,240 MCP servers, finding 13% enable undocumented privileged or unauthorized actions by LLM agents

Insecure Plugin Design nlp

PDF

attack arXiv Jan 29, 2026 · Jan 2026

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Ningyuan He, Ronghong Huang, Qianqian Tang et al. · University of Science and Technology of China · Shandong University +1 more

Zero-query black-box text attacks evade LLM-based in-context learning classifiers with 95.3% success, plus joint defense recipe

Prompt Injection nlp

PDF Code

defense arXiv Jan 29, 2026 · Jan 2026

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

Mingzu Liu, Hao Fang, Runmin Cong · Shandong University · Key Laboratory of Machine Intelligence and System Control

Defends MLLMs against fine-tuning backdoors by detecting attention allocation divergence across instruction, vision, and query components unsupervisedly

Model Poisoning visionnlpmultimodal

PDF Code

attack arXiv Jan 27, 2026 · Jan 2026

GraphDLG: Exploring Deep Leakage from Gradients in Federated Graph Learning

Shuyue Wei, Wantong Chen, Tongyu Wei et al. · Shandong University · Beihang University +1 more

Gradient inversion attack on federated graph learning recovers private graph structure and node features from shared gradients via a closed-form recursive rule

Model Inversion Attack graphfederated-learning

PDF

attack arXiv Jan 16, 2026 · Jan 2026

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

Li Wang, Wenyu Chen, Ning Yu et al. · Shandong University · State Key Laboratory of Cryptography and Digital Economy Security +2 more

First MIA framework against text-to-video models exploiting sparse keyframe memorization and temporal consistency signals to infer training membership

Membership Inference Attack visiongenerative

PDF Code

attack arXiv Jan 2, 2026 · Jan 2026

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

Yueyan Dong, Minghui Xu, Qin Hu et al. · Shandong University · Guangdong University of Finance and Economics +2 more

Exploits LoRA's decoupled A/B matrix aggregation in federated LLM fine-tuning to inject stealthy malicious updates that degrade model quality while evading anomaly detectors

Data Poisoning Attack Transfer Learning Attack nlpfederated-learning

PDF

attack IACR ePrint Dec 19, 2025 · Dec 2025

Cryptanalysis of Pseudorandom Error-Correcting Codes

Tianrui Wang, Anyu Wang, Tianshuo Cong et al. · Tsinghua University · Shandong University

Cryptanalytic attacks break PRC-based AI content watermarks in 2^22 operations, validated against DeepSeek and Stable Diffusion

Output Integrity Attack nlpgenerativevision

PDF

tool arXiv Dec 13, 2025 · Dec 2025

UniMark: Artificial Intelligence Generated Content Identification Toolkit

Meilin Li, Ji He, Yi Yu et al. · Shanghai AI Laboratory · Shandong University +1 more

Unified open-source toolkit for multimodal AIGC governance via hidden watermarking and visible compliance marking

Output Integrity Attack multimodalnlpvisionaudio

PDF Code

benchmark arXiv Dec 6, 2025 · Dec 2025

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Yuhang Huang, Junchao Li, Boyang Ma et al. · Shandong University · City University of Hong Kong

First holistic security audit of an LLM-powered robot platform reveals ten cross-layer vulnerabilities including multilingual LLM safety bypass and full physical hijack

Prompt Injection Excessive Agency multimodalnlp

PDF

Loading more papers…

Latest papers

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Towards Generalizable Deepfake Detection via Real Distribution Bias Correction

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

When Skills Lie: Hidden-Comment Injection in LLM Agents

Don't believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

GraphDLG: Exploring Deep Leakage from Gradients in Federated Graph Learning

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

Cryptanalysis of Pseudorandom Error-Correcting Codes

UniMark: Artificial Intelligence Generated Content Identification Toolkit

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue