ML Security Papers

Latest papers

38 papers

defense arXiv Apr 6, 2026 · 2d ago

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong et al. · Singapore Management University · China University of Mining and Technology +4 more

Defends vision-language models against backdoors using patch augmentation and cross-view regularization to break trigger invariance

Model Poisoning multimodalvisionnlp

PDF

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

vlm multimodal llm transformer Singapore Management University · China University of Mining and Technology · The University of Melbourne +3 more

PDF arXiv

defense arXiv Mar 25, 2026 · 14d ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal

PDF Code

defense arXiv Mar 16, 2026 · 23d ago

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Bingxue Zhang, Yang Gao, Feida Zhu et al. · University of Shanghai for Science and Technology · Singapore Management University +1 more

Universal adversarial defense against heterogeneous generative models using feature-space alignment to protect images from unauthorized editing

Input Manipulation Attack visiongenerative

PDF

attack arXiv Mar 16, 2026 · 23d ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal

PDF

attack arXiv Mar 12, 2026 · 27d ago

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Zikang Ding, Haomiao Yang, Meng Hao et al. · University of Electronic Science and Technology of China · Singapore Management University +2 more

Proposes temporally-delayed backdoor attacks on NLP pre-trained models using common everyday words as stealthy triggers

Model Poisoning nlp

PDF

benchmark arXiv Mar 8, 2026 · 4w ago

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li et al. · Singapore Management University · The University of Melbourne +1 more

Benchmarks beneficial uses of LLM backdoors for safety enforcement, access control, and watermarking via trigger conditioning

Model Poisoning Prompt Injection nlp

PDF Code

attack arXiv Feb 27, 2026 · 5w ago

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Wai Tuck Wong, Jun Sun, Arunesh Sinha · Singapore Management University · Rutgers University

Crafts adversarial images inducing numerical instability in VLMs, causing benchmark performance degradation with minimal pixel perturbation

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF

defense arXiv Feb 25, 2026 · 6w ago

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

Wenbin Wang, Yuge Huang, Jianqing Xu et al. · Wuhan University · Tencent Youtu Lab +1 more

Fixes attention dilution in MLLM-based AI-generated image detectors via optimal transport and cross-attention fusion

Output Integrity Attack visionmultimodal

PDF Code

attack arXiv Feb 1, 2026 · 9w ago

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

defense arXiv Jan 29, 2026 · 9w ago

AtPatch: Debugging Transformers via Hot-Fixing Over-Attention

Shihao Weng, Yang Feng, Jincheng Li et al. · Nanjing University · Singapore Management University

Inference-time defense that neutralizes backdoor triggers in transformers by detecting and redistributing anomalous attention maps without modifying weights

Model Poisoning visionnlp

PDF

attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp

PDF

attack arXiv Jan 19, 2026 · 11w ago

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Nay Myat Min, Long H. Pham, Hongyu Zhang et al. · Singapore Management University · Chongqing University

Attacks LLM hallucination detectors by fine-tuning LoRA adapters to camouflage internal uncertainty, hidden-state, and attention signals

Output Integrity Attack nlp

PDF

benchmark arXiv Jan 8, 2026 · Jan 2026

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng, Yige Li, Yutao Wu et al. · Fudan University · Alibaba Group +4 more

Benchmark framework systematizing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows

Model Poisoning Excessive Agency nlpmultimodal

1 citations PDF Code

defense arXiv Jan 6, 2026 · Jan 2026

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang, Jun Sun · Singapore Management University

Defends proprietary text from unauthorized LLM training by injecting alignment-triggering disclaimers that sabotage fine-tuning via persistent safety-layer activation

Data Poisoning Attack Training Data Poisoning nlp

PDF

benchmark arXiv Dec 17, 2025 · Dec 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong, Zhiqi Shen, Lei Wang et al. · East China Normal University · Salesforce AI Research +2 more

Benchmark of 20 MCP attack types across 5 real-world domains revealing escalating LLM agent safety gaps in multi-step tool-use workflows

Insecure Plugin Design Excessive Agency nlp

4 citations PDF Code

defense USENIX Security Dec 17, 2025 · Dec 2025

From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

Xiangrui Xu, Zhize Li, Yufei Han et al. · Beijing Jiaotong University · Singapore Management University +3 more

Theoretical framework quantifying data reconstruction attack risk in federated learning via Jacobian spectral analysis, with adaptive noise defenses

Model Inversion Attack federated-learningvision

1 citations PDF

benchmark arXiv Dec 4, 2025 · Dec 2025

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Wei Zhao, Zhe Li, Jun Sun · Singapore Management University

Surveys and benchmarks causality-based jailbreak attacks and defenses, showing safety mechanisms localize in 1–2% of LLM neurons

Model Poisoning Prompt Injection nlp

PDF Code

benchmark arXiv Nov 24, 2025 · Nov 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang et al. · Fudan University · Singapore Management University +1 more

Benchmarks backdoor attacks on VLMs, finding text triggers achieve 90%+ success at just 1% poisoning rate

Model Poisoning visionnlpmultimodal

PDF Code

attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning nlp

2 citations PDF Code

defense arXiv Nov 20, 2025 · Nov 2025

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Wei Zhao, Zhe Li, Yige Li et al. · Singapore Management University

Defends VLMs against adversarial visual jailbreaks using two-level vector quantization as a discrete bottleneck

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF Code

Loading more papers…

Latest papers

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Unleashing Vision-Language Semantics for Deepfake Video Detection

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

AtPatch: Debugging Transformers via Hot-Fixing Over-Attention

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue