Latest papers

38 papers
defense arXiv Apr 6, 2026 · 2d ago

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong et al. · Singapore Management University · China University of Mining and Technology +4 more

Defends vision-language models against backdoors using patch augmentation and cross-view regularization to break trigger invariance

Model Poisoning multimodalvisionnlp
PDF
defense arXiv Mar 25, 2026 · 14d ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal
PDF Code
defense arXiv Mar 16, 2026 · 23d ago

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Bingxue Zhang, Yang Gao, Feida Zhu et al. · University of Shanghai for Science and Technology · Singapore Management University +1 more

Universal adversarial defense against heterogeneous generative models using feature-space alignment to protect images from unauthorized editing

Input Manipulation Attack visiongenerative
PDF
attack arXiv Mar 16, 2026 · 23d ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Mar 12, 2026 · 27d ago

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Zikang Ding, Haomiao Yang, Meng Hao et al. · University of Electronic Science and Technology of China · Singapore Management University +2 more

Proposes temporally-delayed backdoor attacks on NLP pre-trained models using common everyday words as stealthy triggers

Model Poisoning nlp
PDF
benchmark arXiv Mar 8, 2026 · 4w ago

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li et al. · Singapore Management University · The University of Melbourne +1 more

Benchmarks beneficial uses of LLM backdoors for safety enforcement, access control, and watermarking via trigger conditioning

Model Poisoning Prompt Injection nlp
PDF Code
attack arXiv Feb 27, 2026 · 5w ago

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Wai Tuck Wong, Jun Sun, Arunesh Sinha · Singapore Management University · Rutgers University

Crafts adversarial images inducing numerical instability in VLMs, causing benchmark performance degradation with minimal pixel perturbation

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
defense arXiv Feb 25, 2026 · 6w ago

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

Wenbin Wang, Yuge Huang, Jianqing Xu et al. · Wuhan University · Tencent Youtu Lab +1 more

Fixes attention dilution in MLLM-based AI-generated image detectors via optimal transport and cross-attention fusion

Output Integrity Attack visionmultimodal
PDF Code
attack arXiv Feb 1, 2026 · 9w ago

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
defense arXiv Jan 29, 2026 · 9w ago

AtPatch: Debugging Transformers via Hot-Fixing Over-Attention

Shihao Weng, Yang Feng, Jincheng Li et al. · Nanjing University · Singapore Management University

Inference-time defense that neutralizes backdoor triggers in transformers by detecting and redistributing anomalous attention maps without modifying weights

Model Poisoning visionnlp
PDF
attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp
PDF
attack arXiv Jan 19, 2026 · 11w ago

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Nay Myat Min, Long H. Pham, Hongyu Zhang et al. · Singapore Management University · Chongqing University

Attacks LLM hallucination detectors by fine-tuning LoRA adapters to camouflage internal uncertainty, hidden-state, and attention signals

Output Integrity Attack nlp
PDF
benchmark arXiv Jan 8, 2026 · Jan 2026

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng, Yige Li, Yutao Wu et al. · Fudan University · Alibaba Group +4 more

Benchmark framework systematizing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows

Model Poisoning Excessive Agency nlpmultimodal
1 citations PDF Code
defense arXiv Jan 6, 2026 · Jan 2026

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang, Jun Sun · Singapore Management University

Defends proprietary text from unauthorized LLM training by injecting alignment-triggering disclaimers that sabotage fine-tuning via persistent safety-layer activation

Data Poisoning Attack Training Data Poisoning nlp
PDF
benchmark arXiv Dec 17, 2025 · Dec 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong, Zhiqi Shen, Lei Wang et al. · East China Normal University · Salesforce AI Research +2 more

Benchmark of 20 MCP attack types across 5 real-world domains revealing escalating LLM agent safety gaps in multi-step tool-use workflows

Insecure Plugin Design Excessive Agency nlp
4 citations PDF Code
defense USENIX Security Dec 17, 2025 · Dec 2025

From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

Xiangrui Xu, Zhize Li, Yufei Han et al. · Beijing Jiaotong University · Singapore Management University +3 more

Theoretical framework quantifying data reconstruction attack risk in federated learning via Jacobian spectral analysis, with adaptive noise defenses

Model Inversion Attack federated-learningvision
1 citations PDF
benchmark arXiv Dec 4, 2025 · Dec 2025

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Wei Zhao, Zhe Li, Jun Sun · Singapore Management University

Surveys and benchmarks causality-based jailbreak attacks and defenses, showing safety mechanisms localize in 1–2% of LLM neurons

Model Poisoning Prompt Injection nlp
PDF Code
benchmark arXiv Nov 24, 2025 · Nov 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang et al. · Fudan University · Singapore Management University +1 more

Benchmarks backdoor attacks on VLMs, finding text triggers achieve 90%+ success at just 1% poisoning rate

Model Poisoning visionnlpmultimodal
PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning nlp
2 citations PDF Code
defense arXiv Nov 20, 2025 · Nov 2025

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Wei Zhao, Zhe Li, Yige Li et al. · Singapore Management University

Defends VLMs against adversarial visual jailbreaks using two-level vector quantization as a discrete bottleneck

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF Code
Loading more papers…