ML Security Papers

Latest papers

36 papers

attack arXiv Apr 20, 2026 · 4w ago

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute

Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

attack arXiv Apr 2, 2026 · 7w ago

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du et al. · East China Normal University · Zhongguancun Academy +3 more

Physically realizable 3D adversarial textures that degrade vision-language-action robot models with 96.7% task failure rates

Input Manipulation Attack visionmultimodalnlp

PDF Code

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

vlm multimodal transformer East China Normal University · Zhongguancun Academy · A*STAR +2 more

PDF arXiv Code

benchmark arXiv Mar 30, 2026 · 7w ago

Evaluating Privilege Usage of Agents on Real-World Tools

Quan Zhang, Lianhang Fu, Lvsi Lian et al. · East China Normal University · Xinjiang University +1 more

Benchmark evaluating LLM agents' privilege control under prompt injection attacks using real-world tools, finding 84.80% attack success

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Mar 25, 2026 · 8w ago

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Hongyi Miao, Jun Jia, Xincheng Wang et al. · Shandong University · Shanghai Jiao Tong University +4 more

Data poisoning defense that protects private photo datasets from VLM fine-tuning attacks that extract identity-affiliation relationships

Data Poisoning Attack Sensitive Information Disclosure visionnlpmultimodal

PDF

Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model's internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user's private information upon input of their photos. To benchmark VLMs' susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs'encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

vlm transformer multimodal Shandong University · Shanghai Jiao Tong University · Donghua University +3 more

PDF arXiv

defense arXiv Mar 25, 2026 · 8w ago

AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Jie Song, Jun Jia, Wei Sun et al. · Macao Polytechnic University · Shanghai Jiao Tong University +2 more

Medical image fusion model embedding visible copyright watermarks in outputs, removable only with authentication keys

Model Theft Output Integrity Attack visionmultimodal

PDF

defense arXiv Feb 10, 2026 · Feb 2026

AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models

Yue Li, Xin Yi, Dongsheng Shi et al. · East China Normal University · Hasso Plattner Institute +1 more

Attention-guided dynamic watermarking for LVLM outputs that preserves visual fidelity while achieving 99.36% AUC detection accuracy

Output Integrity Attack nlpmultimodalvision

PDF

attack arXiv Feb 7, 2026 · Feb 2026

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun, Minrui Luo, Yu Wang et al. · Shanghai Qi Zhi Institute · East China Normal University +3 more

Recovers private edited data from LLM parameter update matrices using spectral analysis and entropy-based prompt reconstruction

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Jan 30, 2026 · Jan 2026

VocBulwark: Towards Practical Generative Speech Watermarking via Additional-Parameter Injection

Weizhi Liu, Yue Li, Zhaoxia Yin · East China Normal University · Huaqiao University

Injects adapter parameters into speech vocoders to embed robust, high-fidelity watermarks in AI-generated audio for provenance tracking

Output Integrity Attack audiogenerative

PDF

defense arXiv Jan 19, 2026 · Jan 2026

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Guanghao Zhou, Panjia Qiu, Cen Chen et al. · East China Normal University · Ant Group

Post-hoc LLM safety re-alignment via low-rank safety subspace fusion to restore guardrails degraded by fine-tuning

Transfer Learning Attack Prompt Injection nlp

3 citations 1 influentialPDF

defense arXiv Dec 20, 2025 · Dec 2025

Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks

Yucheng Fan, Jiawei Chen, Yu Tian et al. · East China Normal University · Zhongguancun Academy +1 more

Adversarial image perturbations shield social-media photos from VLM-based private attribute inference while preserving visual quality

Input Manipulation Attack visionmultimodal

PDF

benchmark arXiv Dec 17, 2025 · Dec 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong, Zhiqi Shen, Lei Wang et al. · East China Normal University · Salesforce AI Research +2 more

Benchmark of 20 MCP attack types across 5 real-world domains revealing escalating LLM agent safety gaps in multi-step tool-use workflows

Insecure Plugin Design Excessive Agency nlp

4 citations PDF Code

defense TIFS Dec 17, 2025 · Dec 2025

ArcGen: Generalizing Neural Backdoor Detection Across Diverse Architectures

Zhonghao Yang, Cheng Luo, Daojing He et al. · East China Normal University · Harbin Institute of Technology +2 more

Defends against backdoor attacks by learning architecture-invariant model features for robust detection across unseen model architectures

Model Poisoning vision

PDF Code

defense arXiv Nov 29, 2025 · Nov 2025

SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Yongkang Hu, Yu Cheng, Yushuo Zhang et al. · East China Normal University · Shanghai Innovation Institute

Continual-learning detection framework for AI-generated images using scene-aware expert modules and gradient-projection to prevent forgetting

Output Integrity Attack vision

PDF

defense arXiv Nov 25, 2025 · Nov 2025

Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation

Jun Jia, Hongyi Miao, Yingjie Zhou et al. · Shandong University · Shanghai Jiao Tong University +2 more

Adversarial perturbation defense that disrupts zero-shot diffusion generation of faces and styles while permitting authenticated access via reversible embedding encryption

Input Manipulation Attack Output Integrity Attack visiongenerative

PDF

defense arXiv Nov 25, 2025 · Nov 2025

DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models

Jun Jia, Hongyi Miao, Yingjie Zhou et al. · Shanghai Jiao Tong University · Shandong University +2 more

Defends facial images from diffusion model customization by adding dual-layer adversarial perturbations that disrupt both fine-tuning and zero-shot identity generation

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Xincheng Wang, Hanchi Sun, Wenjun Sun et al. · Donghua University · Shanghai JiaoTong University +3 more

Benchmarks dataset watermarking schemes for diffusion model traceability and proposes a removal attack that fully defeats them

Output Integrity Attack visiongenerative

PDF

attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp

2 citations PDF Code

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

llm The Hong Kong University of Science and Technology · East China Normal University · Flexera +4 more

PDF arXiv DOI Code

defense arXiv Nov 18, 2025 · Nov 2025

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi, Yue Li, Dongsheng Shi et al. · East China Normal University

Three-stage defense framework for educational LLMs that resists both jailbreak and fine-tuning safety-removal attacks

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

defense arXiv Nov 18, 2025 · Nov 2025

Sigil: Server-Enforced Watermarking in U-Shaped Split Federated Learning via Gradient Injection

Zhengchunmin Dai, Jiaxiong Tang, Peng Sun et al. · East China Normal University · Hunan University +1 more

Embeds ownership watermarks into client models via server-side gradient injection in split federated learning to defend against model theft

Model Theft visionfederated-learning

PDF

defense arXiv Nov 17, 2025 · Nov 2025

Robust Client-Server Watermarking for Split Federated Learning

Jiaxiong Tang, Zhengchunmin Dai, Liantao Wu et al. · East China Normal University · Hunan University +1 more

Embeds asymmetric client-server watermarks into split federated learning models to prove joint ownership and resist removal attacks

Model Theft federated-learning

PDF

Loading more papers…

Latest papers

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Evaluating Privilege Usage of Agents on Real-World Tools

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models

Reverse-Engineering Model Editing on Language Models

VocBulwark: Towards Practical Generative Speech Watermarking via Additional-Parameter Injection

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

ArcGen: Generalizing Neural Backdoor Detection Across Diverse Architectures

SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation

DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Sigil: Server-Enforced Watermarking in U-Shaped Split Federated Learning via Gradient Injection

Robust Client-Server Watermarking for Split Federated Learning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue