defense 2025

Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

Nikoo Naghavian , Mostafa Tavassolipour

0 citations · 57 references · arXiv

α

Published on arXiv

2510.02913

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

CAW outperforms PMG-AFT and TGA-ZSR in robust accuracy under AutoAttack on TinyImageNet and 14 zero-shot datasets while using less memory than both baselines.

CAW (Confidence-Aware Weighting)

Novel technique introduced


Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.


Key Contributions

  • Confidence-Aware loss that up-weights uncertain/hard adversarial examples by scaling KL divergence between clean and adversarial prediction distributions
  • Feature alignment regularization that minimizes distance between frozen and fine-tuned CLIP image encoder features on adversarial inputs to preserve semantic knowledge
  • CAW achieves state-of-the-art zero-shot robust accuracy on TinyImageNet and 14 datasets under AutoAttack while requiring less memory than PMG-AFT and TGA-ZSR

🛡️ Threat Analysis

Input Manipulation Attack

Paper's primary contribution is a defense against adversarial image perturbations (PGD, AutoAttack, CW) that cause misclassification in CLIP at inference time — the canonical ML01 threat of input manipulation / evasion attacks on image classifiers.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timeuntargeteddigital
Datasets
TinyImageNetImageNet
Applications
image classificationzero-shot classification