defense 2025

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

Wenjing lu 1,2, Zerui Tao 1, Dongping Zhang 1,3, Yuning Qiu 1, Yang Yang 2, Qibin Zhao 1

0 citations · 73 references · arXiv

α

Published on arXiv

2512.12997

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Restores calibrated predictive uncertainty under adversarial attack while maintaining competitive adversarial robustness and clean accuracy on multiple zero-shot classification benchmarks.

Dirichlet-based Uncertainty-Calibrated Adversarial Fine-Tuning

Novel technique introduced


CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.


Key Contributions

  • Identifies and formalizes a previously overlooked reliability gap: adversarial perturbations suppress predictive uncertainty in CLIP, causing severe miscalibration and overconfidence beyond accuracy degradation.
  • Proposes reparameterizing CLIP output logits as Dirichlet distribution concentration parameters to jointly capture inter-class semantic structure and confidence magnitude.
  • Novel adversarial fine-tuning objective that holistically aligns Dirichlet distributions between clean and adversarial inputs, restoring calibrated uncertainty while achieving competitive zero-shot adversarial robustness.

🛡️ Threat Analysis

Input Manipulation Attack

Primary contribution is a defense against adversarial input perturbations targeting CLIP's zero-shot classification — adversarial fine-tuning that addresses both accuracy degradation and the newly identified uncertainty suppression caused by adversarial examples at inference time.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timedigital
Datasets
ImageNetCIFAR-10CIFAR-100Oxford PetsCaltech-101
Applications
zero-shot image classification