A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .

Key Contributions

Energy-Guided Test-Time Transformation (ET3) defense with theoretical proof of classification success under reasonable assumptions
Training-free method that works across image classification, zero-shot CLIP classification, and LVLM tasks (captioning, VQA)
Demonstrates superiority over existing test-time defenses across multiple datasets and model architectures

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial perturbations at inference time by transforming inputs to minimize energy, with theoretical guarantees for correct classification.

Details

Domains

visionmultimodalnlp

Model Types

vlmcnntransformermultimodal

Threat Tags

inference_timedigital

Applications

2025 0 cit.

Input Manipulation Attack

79%