attack 2025

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

0 citations

Published on arXiv

2508.03365

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves average attack success rates of 60–78% across two benchmarks and five multimodal LLMs while keeping adversarial audio intelligible to human listeners.

WhisperInject (RL-PGD)

Novel technique introduced

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.

Key Contributions

WhisperInject: a two-stage audio jailbreak framework where Stage 1 uses RL-PGD to automatically discover model-native harmful responses without manual target selection, and Stage 2 embeds these as imperceptible perturbations in benign audio carriers
Novel RL-PGD optimization method that uses reward-based white-box gradient descent to elicit harmful native responses from audio-language models
Achieves 60–78% attack success rates across two benchmarks and five state-of-the-art multimodal LLMs while preserving carrier audio intelligibility for human listeners

🛡️ Threat Analysis

Input Manipulation Attack

Both stages use gradient-based optimization (RL-PGD in Stage 1, PGD-style in Stage 2) to craft adversarial audio perturbations that cause misclassification or harmful output generation at inference time — a clear adversarial input manipulation attack on the audio modality, analogous to adversarial patches/perturbations on VLMs.

Details

Domains

audiomultimodalnlp

Model Types

llmmultimodal

Threat Tags

white_boxinference_timetargeteddigital

Applications

audio-language modelsvoice assistantsmultimodal chatbotsiot smart devices

Read PDF arXiv

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

On Optimizing Multimodal Jailbreaks for Spoken Language Models

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Visual Memory Injection Attacks for Multi-Turn Conversations

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack