defense 2026

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Mujtaba Hussain Mirza 1, Antonio D'Orazio 2, Odelia Melamed 1, Iacopo Masi 1

0 citations

α

Published on arXiv

2603.26984

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Provides provable test-time defense against adversarial attacks for LVLMs and classifiers without requiring model retraining

ET3 (Energy-Guided Test-Time Transformation)

Novel technique introduced


Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .


Key Contributions

  • Energy-Guided Test-Time Transformation (ET3) defense with theoretical proof of classification success under reasonable assumptions
  • Training-free method that works across image classification, zero-shot CLIP classification, and LVLM tasks (captioning, VQA)
  • Demonstrates superiority over existing test-time defenses across multiple datasets and model architectures

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial perturbations at inference time by transforming inputs to minimize energy, with theoretical guarantees for correct classification.


Details

Domains
visionmultimodalnlp
Model Types
vlmcnntransformermultimodal
Threat Tags
inference_timedigital
Applications
image classificationzero-shot classificationimage captioningvisual question answering