attack 2025

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo ¹, Zenghui Yuan ¹, Wenqi Wei ², Daizong Liu ³, Pan Zhou ¹

¹ Huazhong University of Science and Technology

² Fordham University

³ Wuhan University

0 citations · 30 references · arXiv

Published on arXiv

2511.16163

Input Manipulation Attack

OWASP ML Top 10 — ML01

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

VTIA achieves superior stability and token-length maximization compared to prior EOS-suppression attacks across four popular VLMs, with imperceptible image perturbations and no need to invoke the target LLM during iterative optimization.

VTIA (Verbose-Text Induction Attack)

Novel technique introduced

With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

Key Contributions

Two-stage VTIA framework: (1) RL-based adversarial prompt search to identify malicious prompt embeddings that induce verbose LLM outputs, and (2) vision-aligned perturbation optimization that aligns perturbed image embeddings with adversarial prompt embeddings without querying the target LLM during optimization.
Explicit maximization of output token length as an optimization objective, overcoming the instability of prior EOS-suppression methods.
Demonstrated effectiveness, efficiency, and transferability across four popular VLMs.

🛡️ Threat Analysis

Input Manipulation Attack

VTIA crafts imperceptible adversarial perturbations on input images targeting VLMs at inference time using gradient-based optimization — a direct adversarial input manipulation attack on vision-language models.

Details

Domains

visionmultimodalnlp

Model Types

vlmllmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Applications

vision-language modelsvisual question answeringimage captioningvlm inference apis

Read PDF arXiv DOI

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches

A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack

On the Adversarial Robustness of 3D Large Vision-Language Models

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models