attack 2026

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Xiaomei Zhang ¹, Zhaoxi Zhang ², Leo Yu Zhang ¹, Yanjun Zhang ¹, Guanhong Tao ³, Shirui Pan ¹

¹ Griffith University

² University of Technology Sydney

³ University of Utah

0 citations · 64 references · arXiv

Published on arXiv

2601.12042

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Visual token compression enables adversarial attacks that are completely invisible under uncompressed inference but cause consistent and severe model failures under compression, exposing a fundamental efficiency–security trade-off in LVLM deployment.

CAA (Compression-Aware Attack)

Novel technique introduced

Visual token compression is widely adopted to improve the inference efficiency of Large Vision-Language Models (LVLMs), enabling their deployment in latency-sensitive and resource-constrained scenarios. However, existing work has mainly focused on efficiency and performance, while the security implications of visual token compression remain largely unexplored. In this work, we first reveal that visual token compression substantially degrades the robustness of LVLMs: models that are robust under uncompressed inference become highly vulnerable once compression is enabled. These vulnerabilities are state-specific; failure modes emerge only in the compressed setting and completely disappear when compression is disabled, making them particularly hidden and difficult to diagnose. By analyzing the key stages of the compression process, we identify instability in token importance ranking as the primary cause of this robustness degradation. Small and imperceptible perturbations can significantly alter token rankings, leading the compression mechanism to mistakenly discard task-critical information and ultimately causing model failure. Motivated by this observation, we propose a Compression-Aware Attack to systematically study and exploit this vulnerability. CAA directly targets the token selection mechanism and induces failures exclusively under compressed inference. We further extend this approach to more realistic black-box settings and introduce Transfer CAA, where neither the target model nor the compression configuration is accessible. We further evaluate potential defenses and find that they provide only limited protection. Extensive experiments across models, datasets, and compression methods show that visual token compression significantly undermines robustness, revealing a previously overlooked efficiency-security trade-off.

Key Contributions

Reveals that visual token compression substantially degrades LVLM robustness, creating hidden, compression-specific failure modes that disappear when compression is disabled
Identifies instability in token importance ranking as the root cause, showing imperceptible perturbations can redirect compression to discard task-critical tokens
Proposes Compression-Aware Attack (CAA) and Transfer CAA (T-CAA) that exploit this vulnerability in both white-box and realistic black-box settings, with evaluations showing limited effectiveness of candidate defenses

🛡️ Threat Analysis

Input Manipulation Attack

CAA crafts imperceptible adversarial visual perturbations that manipulate token importance rankings at inference time, causing the compression mechanism to discard task-critical tokens and inducing model failure — a gradient-exploiting adversarial input attack targeting VLMs at inference time.

Details

Domains

visionnlpmultimodal

Model Types

vlmtransformer

Threat Tags

white_boxblack_boxinference_timedigital

Datasets

MMESeedBenchVQAv2MMBench

Applications

visual question answeringlarge vision-language model inferenceautonomous driving perception

Read PDF arXiv DOI

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

CaptionFool: Universal Image Captioning Model Attacks

Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations

VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers