defense 2025

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian , Chenhao Lin , Zhengyu Zhao , Chao Shen

0 citations · 25 references · arXiv

α

Published on arXiv

2512.07222

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

FDA reduces attack success rate by 18–90% across retrieval and visual grounding tasks with ≤0.6% performance degradation across three tested VLM architectures under 6 different adversarial attacks.

FDA (Function-word De-Attention)

Novel technique introduced


To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.


Key Contributions

  • Identifies function words as a previously overlooked source of VLM vulnerability to cross-modal adversarial attacks
  • Proposes FDA (Function-word De-Attention), a training-free differential attention mechanism that subtracts function-word cross-attention from full cross-attention, analogous to a differential amplifier
  • Demonstrates 18–90% ASR reduction across retrieval and visual grounding tasks on 3 VLM architectures with at most 0.6% performance drop, and shows zero-shot generalization

🛡️ Threat Analysis

Input Manipulation Attack

Defends against cross-modal adversarial attacks on VLMs that manipulate inputs (image or text) at inference time to degrade retrieval and grounding performance — directly addresses adversarial example attacks and proposes a novel defense via modified attention.


Details

Domains
multimodalvisionnlp
Model Types
vlmtransformer
Threat Tags
white_boxinference_timedigital
Applications
image-text retrievalvisual grounding