attack 2025

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang 1, Xin Wang 1, Yang Yao 2, Juncheng Li 1, Xuan Tong 1, Xingjun Ma 1

0 citations

α

Published on arXiv

2508.01741

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned Qwen2-VL variants (including safety-enhanced models) where standard PGD-based jailbreaks exhibit negligible transferability.

SEA (Simulated Ensemble Attack)

Novel technique introduced


The widespread practice of fine-tuning open-source Vision-Language Models (VLMs) raises a critical security concern: jailbreak vulnerabilities in base models may persist in downstream variants, enabling transferable attacks across fine-tuned systems. To investigate this risk, we propose the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework that assumes full access to the base VLM but no knowledge of the fine-tuned target. SEA enhances transferability via Fine-tuning Trajectory Simulation (FTS), which models bounded parameter variations in the vision encoder, and Targeted Prompt Guidance (TPG), which stabilizes adversarial optimization through auxiliary textual guidance. Experiments on the Qwen2-VL family demonstrate that SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks exhibit negligible transferability. Further analysis reveals that fine-tuning primarily induces localized parameter shifts around the base model, explaining why attacks optimized over a simulated neighborhood transfer effectively. We also show that SEA generalizes across different base generations (e.g., Qwen2.5/3-VL), indicating that its effectiveness arises from shared fine-tuning-induced behaviors rather than architecture- or initialization-specific factors.


Key Contributions

  • Simulated Ensemble Attack (SEA): a grey-box jailbreak framework that achieves transferable adversarial images against fine-tuned VLMs using only base model access
  • Fine-tuning Trajectory Simulation (FTS): randomized perturbations to the vision encoder that approximate real-world parameter shifts induced by fine-tuning, expanding the adversarial neighborhood
  • Targeted Prompt Guidance (TPG): auxiliary textual guidance mechanism that stabilizes adversarial optimization under perturbed model ensembles and improves convergence

🛡️ Threat Analysis

Input Manipulation Attack

SEA crafts adversarial image perturbations using gradient-based optimization (PGD-based) that manipulate VLM outputs at inference time — core mechanism is input manipulation via adversarial visual perturbations targeting the vision encoder.


Details

Domains
visionmultimodalnlp
Model Types
vlmtransformer
Threat Tags
grey_boxinference_timetargeteddigital
Datasets
Qwen2-VL familyAdvBench
Applications
vision-language modelsmultimodal chatbotssafety-aligned vlms