When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Hiskias Dingeto 1,2, Taeyoun Kwon 1,2, Dasol Choi 1,3, Bodam Kim 3, DongGeon Lee 1,4, Haon Park 1,2, JaeHoon Lee 5, Jongho Shin 5
Published on arXiv
2508.03365
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves average attack success rates of 60–78% across two benchmarks and five multimodal LLMs while keeping adversarial audio intelligible to human listeners.
WhisperInject (RL-PGD)
Novel technique introduced
As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.
Key Contributions
- WhisperInject: a two-stage audio jailbreak framework where Stage 1 uses RL-PGD to automatically discover model-native harmful responses without manual target selection, and Stage 2 embeds these as imperceptible perturbations in benign audio carriers
- Novel RL-PGD optimization method that uses reward-based white-box gradient descent to elicit harmful native responses from audio-language models
- Achieves 60–78% attack success rates across two benchmarks and five state-of-the-art multimodal LLMs while preserving carrier audio intelligibility for human listeners
🛡️ Threat Analysis
Both stages use gradient-based optimization (RL-PGD in Stage 1, PGD-style in Stage 2) to craft adversarial audio perturbations that cause misclassification or harmful output generation at inference time — a clear adversarial input manipulation attack on the audio modality, analogous to adversarial patches/perturbations on VLMs.