survey 2025

Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models

Zane Xu 1, Jason Sun 2

0 citations

α

Published on arXiv

2508.05237

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Standard adversarial training severs vision-language alignment in CLIP, motivating two defense paradigms; embedding space re-engineering (LAAT, TIMA) and latent-space purification (CLIPure) represent the most principled advances toward resolving the robustness-generalization trade-off.


This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model's zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.


Key Contributions

  • Synthesizes and categorizes eight papers on zero-shot adversarial robustness for VLMs into two paradigms: Adversarial Fine-Tuning (TeCoA, PMG-AFT, LAAT, TIMA, TGA-ZSR) and Training-Free/Test-Time Defenses (AOM, TTC, CLIPure)
  • Distills the core robustness-generalization dilemma showing that standard adversarial training catastrophically destroys CLIP's zero-shot capabilities across 15 unseen datasets
  • Identifies future directions including hybrid defense models and large-scale adversarial pre-training for natively robust foundation models

🛡️ Threat Analysis

Input Manipulation Attack

The surveyed papers all target adversarial input perturbation attacks on VLMs at inference time — gradient-based perturbations that cause misclassification in zero-shot settings. Both the attack threat and the defenses (adversarial fine-tuning, latent-space purification, test-time correction) directly address ML01.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timedigital
Datasets
ImageNet
Applications
image classificationzero-shot learningvision-language models