defense 2026

On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar 1, Irina Rish 1, Nicolas Flammarion 2, Francesco Croce 3

0 citations · 41 references · arXiv (Cornell University)

α

Published on arXiv

2602.18252

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Unsupervised adversarial training of discrete image tokenizers substantially improves robustness to both unsupervised and supervised attacks while generalizing to unseen tasks without requiring labeled data.


Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.


Key Contributions

  • First systematic study of adversarial vulnerability in discrete image tokenizers, formulating computationally efficient, application-agnostic attacks that alter codebook token assignments
  • Unsupervised adversarial training defense that fine-tunes tokenizer encoders without labels, significantly improving robustness to both unsupervised and end-to-end supervised attacks
  • Demonstration that tokenizer robustness critically impacts downstream multimodal tasks (classification, retrieval, captioning) and generalizes to unseen tasks and data distributions

🛡️ Threat Analysis

Input Manipulation Attack

Proposes gradient-based adversarial perturbations that change tokens extracted by discrete image tokenizers at inference time, causing downstream task failures across classification, retrieval, and captioning; the defense is unsupervised adversarial training of the tokenizer — a canonical ML01 defense.


Details

Domains
visionmultimodal
Model Types
transformermultimodal
Threat Tags
white_boxinference_timedigitaluntargeted
Applications
image tokenizationimage classificationmultimodal retrievalimage captioningmultimodal foundation models