benchmark 2025

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta ¹, Rylan Schaeffer ², Joshua Kazdan ², Ken Ziyu Liu ², Sanmi Koyejo ²

¹ ETH Zürich

² Stanford University

1 citations · 69 references · arXiv

Published on arXiv

2510.01494

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Data-space attacks consistently transfer between models, while representation-space attacks (including soft-prompt LM jailbreaks and latent VLM attacks) fail to transfer unless models share sufficiently aligned post-projector representations.

Data-space vs. representation-space attack framework

Novel technique introduced

The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks can transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.

Key Contributions

Proposes and mathematically proves a fundamental distinction: data-space adversarial attacks transfer between models, while representation-space attacks do not (unless latent geometries are aligned).
Empirically validates the hypothesis across four settings: image classifiers, language models (soft prompt jailbreaks), VLMs (data-space image jailbreaks), and geometrically aligned VLMs.
Shows that VLM image jailbreaks can transfer when post-projector latent geometries are sufficiently aligned, unifying prior contradictory findings on VLM jailbreak transferability.

🛡️ Threat Analysis

Input Manipulation Attack

The paper constructs and analyzes adversarial attacks (data-space and representation-space) against image classifiers and VLMs, studying why adversarial examples transfer or fail to transfer at inference time — core ML01 territory on adversarial example transferability.

Details

Domains

visionnlpmultimodal

Model Types

cnnllmvlmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Applications

image classificationlanguage model jailbreakingvision-language model jailbreaking

Read PDF arXiv DOI

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

On the Adversarial Robustness of 3D Large Vision-Language Models

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

CaptionFool: Universal Image Captioning Model Attacks

Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations

VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation