attack 2025

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto 1,2, Taeyoun Kwon 1,2, Dasol Choi 1,3, Bodam Kim 3, DongGeon Lee 1,4, Haon Park 1,2, JaeHoon Lee 5, Jongho Shin 5

0 citations

α

Published on arXiv

2508.03365

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves average attack success rates of 60–78% across two benchmarks and five multimodal LLMs while keeping adversarial audio intelligible to human listeners.

WhisperInject (RL-PGD)

Novel technique introduced


As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.


Key Contributions

  • WhisperInject: a two-stage audio jailbreak framework where Stage 1 uses RL-PGD to automatically discover model-native harmful responses without manual target selection, and Stage 2 embeds these as imperceptible perturbations in benign audio carriers
  • Novel RL-PGD optimization method that uses reward-based white-box gradient descent to elicit harmful native responses from audio-language models
  • Achieves 60–78% attack success rates across two benchmarks and five state-of-the-art multimodal LLMs while preserving carrier audio intelligibility for human listeners

🛡️ Threat Analysis

Input Manipulation Attack

Both stages use gradient-based optimization (RL-PGD in Stage 1, PGD-style in Stage 2) to craft adversarial audio perturbations that cause misclassification or harmful output generation at inference time — a clear adversarial input manipulation attack on the audio modality, analogous to adversarial patches/perturbations on VLMs.


Details

Domains
audiomultimodalnlp
Model Types
llmmultimodal
Threat Tags
white_boxinference_timetargeteddigital
Applications
audio-language modelsvoice assistantsmultimodal chatbotsiot smart devices