defense 2025

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

Fengling Zhu , Bo Liu , Jingyu Hua , Sheng Zhong

0 citations · 31 references · arXiv

α

Published on arXiv

2510.11096

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Supervised diffusion purification combined with prompt optimization substantially improves robustness of MLLMs against adversarial visual attacks and generalizes to unknown attack strategies on VQA and captioning tasks.

CoDefend

Novel technique introduced


Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.


Key Contributions

  • Supervised diffusion-based denoising framework that fine-tunes Stable Diffusion on paired adversarial-clean image datasets for task-specific, directional purification — outperforming unsupervised methods like DiffPure
  • Prompt optimization as a complementary textual-side defense to enhance resistance against diverse and unseen adversarial attack strategies
  • Demonstrated strong transferability to unknown adversarial attacks on image captioning and visual question answering benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes a defense against adversarial image perturbations crafted to manipulate MLLM outputs at inference time — the core ML01 threat. The supervised diffusion-based denoising and prompt optimization are countermeasures against adversarial input manipulation targeting the visual modality.


Details

Domains
visionnlpmultimodal
Model Types
vlmdiffusionllmmultimodal
Threat Tags
inference_timedigitalwhite_boxblack_box
Applications
image captioningvisual question answeringmultimodal reasoning