defense 2025

Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

Jiaxiang Liu 1, Jiawei Du 2, Xiao Liu 1, Prayag Tiwari 3, Mingkun Xu 1

1 citations · 53 references · arXiv

α

Published on arXiv

2510.22785

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SCC consistently improves adversarial robustness of CLIP across 22 zero-shot benchmarks while preserving clean accuracy, outperforming state-of-the-art test-time defenses without requiring retraining or labeled data.

Self-Calibrated Consistency (SCC)

Novel technique introduced


Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks -- lack of semantic guidance and vulnerability to view variations -- collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains such as BioMedCLIP.


Key Contributions

  • Identifies three vulnerabilities in existing test-time defenses for VLMs: semantic drift, view sensitivity, and hard-negative dominance, providing theoretical analysis of each
  • Proposes Self-Calibrated Consistency (SCC), combining a Semantic consistency module (cross-modal alignment via soft pseudo-labels and multi-view predictions to repel hard negatives) with a Spatial consistency module (augmented-view agreement to stabilize perturbed predictions)
  • Plug-and-play test-time defense requiring no retraining or labeled data, validated on 22 zero-shot benchmarks across CLIP and derivatives including BioMedCLIP

🛡️ Threat Analysis

Input Manipulation Attack

The paper directly defends against adversarial image perturbations (inference-time input manipulation) that cause CLIP to misclassify by disrupting image-text alignment. SCC is a defense against adversarial examples targeting VLMs, using counterattack warm-up, multi-view consistency, and cross-modal alignment regularization — all standard adversarial robustness defense techniques.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timedigital
Datasets
22 zero-shot benchmarks (unspecified in provided excerpt)
Applications
zero-shot image classificationimage-text retrievalvision-language model inference