defense 2026

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui 1, Xianchao Guan 1, Zijun Xiong 1, Zheng Zhang 1,2

0 citations

α

Published on arXiv

2603.29410

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Outperforms state-of-the-art methods on 15 zero-shot benchmarks while preserving original model capabilities and significantly improving adversarial robustness

AGFT

Novel technique introduced


Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.


Key Contributions

  • Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving cross-modal semantic structure
  • Text-guided adversarial training using soft probabilistic predictions instead of hard labels to maintain visual-textual correspondence
  • Distribution consistency calibration mechanism that adjusts robust model outputs to match temperature-scaled pre-trained model predictions

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial perturbations that cause misclassification in vision-language models. The paper addresses adversarial examples at inference time and proposes adversarial training to improve robustness against gradient-based attacks like PGD and AutoAttack.


Details

Domains
visionnlpmultimodal
Model Types
vlmmultimodaltransformer
Threat Tags
inference_timedigital
Applications
image classificationzero-shot learningvision-language understanding