VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Qilin Liao , Anamika Lochab , Ruqi Zhang
Published on arXiv
2510.17759
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
VERA-V achieves up to 53.75% higher attack success rate than the best baseline on GPT-4o on HarmBench, with strong transferability across open-source and frontier VLMs.
VERA-V
Novel technique introduced
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
Key Contributions
- Recasts multimodal jailbreak generation as variational inference over a joint posterior on paired text-image prompts, enabling distributional exploration of VLM vulnerabilities
- Compositional adversarial strategy combining typographic prompt rendering, diffusion-guided adversarial image synthesis, and structured distractors to produce stealthy coupled attacks
- Achieves up to 53.75% improvement in attack success rate over SOTA baselines on GPT-4o across HarmBench and HADES benchmarks, with strong cross-model transferability
🛡️ Threat Analysis
The framework crafts adversarial visual inputs — diffusion-synthesized images with embedded adversarial cues and structured distractors — specifically to manipulate VLM outputs and bypass guardrails. This constitutes adversarial visual input manipulation of multimodal models.