Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack

Audio-language models combine audio encoders with large language models to enable multimodal reasoning, but they also introduce new security vulnerabilities. We propose a universal targeted latent space attack, an encoder-level adversarial attack that manipulates audio latent representations to induce attacker-specified outputs in downstream language generation. Unlike prior waveform-level or input-specific attacks, our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model. Experiments on Qwen2-Audio-7B-Instruct demonstrate consistently high attack success rates with minimal perceptual distortion, revealing a critical and previously underexplored attack surface at the encoder level of multimodal systems.

Key Contributions

Universal adversarial perturbation learned at the audio encoder level that transfers to downstream LLM text generation without requiring access to or knowledge of the LLM component.
Demonstrates that attacking only the encoder (grey-box access) is sufficient to reliably induce attacker-specified outputs in a 7B-parameter audio-language model.
Reveals a previously underexplored attack surface in multimodal audio-LLM architectures — the encoder's latent space — with generalisation across inputs and speakers.

🛡️ Threat Analysis

Input Manipulation Attack

Crafts gradient-optimized universal adversarial perturbations applied to audio inputs at inference time to manipulate encoder latent representations — a direct adversarial example attack on a multimodal encoder.

Details

Domains

audionlp

Model Types

llmtransformermultimodal

Threat Tags

grey_boxinference_timetargeteddigital

Datasets

Qwen2-Audio-7B-Instruct (target model)

Applications

2026 0 cit.

Input Manipulation Attack

74%