attack 2025

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Divyanshu Kumar , Shreyas Jena , Nitin Aravind Birur , Tanay Baswa , Sahil Agarwal , Prashanth Harshangi

Enkrypt AI

0 citations · 27 references · arXiv

Published on arXiv

2510.20223

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

FigStep-Pro achieves up to 89% attack success rate on Llama-4 variants; models with near-perfect text-only safety (0% ASR) suffer >75% ASR under simple perceptual transformations across seven frontier models.

FigStep-Pro / Intelligent Masking / Wave-Echo/Pitch/Speed

Novel technique introduced

Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks that exploit weaknesses in cross-modal processing. We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models, showing that even simple perceptual transformations can reliably bypass state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation Material) tested against seven frontier models. We explore the effectiveness of attack techniques on MLLMs, including FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer >75\% attack success under perceptually modified inputs, with FigStep-Pro achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further uncover provider-specific weaknesses, with even basic modality transfer yielding 25\% ASR for technical queries. These findings expose a critical gap between text-centric alignment and multimodal threats, demonstrating that current safeguards fail to generalize across cross-modal attacks. The accessibility of these attacks, which require minimal technical expertise, suggests that robust multimodal AI safety will require a paradigm shift toward broader semantic-level reasoning to mitigate possible risks.

Key Contributions

First systematic multimodal red-teaming framework combining visual, audio, and textual attack vectors against frontier VLMs and audio-language models
Novel lightweight attack techniques: FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed)
Empirical demonstration that models with 0% text-only ASR suffer >75% ASR under perceptually simple cross-modal transformations, exposing fundamental misalignment of safety mechanisms

🛡️ Threat Analysis

Details

Domains

visionaudiomultimodalnlp

Model Types

vlmllmmultimodal

Threat Tags

black_boxinference_timetargeted

Datasets

Custom adversarial prompt set (1,900 prompts across harmful content, CBRN, CSEM categories)

Applications

vision-language modelsaudio-language modelsmultimodal chatbots

Read PDF arXiv DOI

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses