attack 2026

SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

Aafiya Hussain , Gaurav Srivastava , Alvi Md Ishmam , Zaber Ibn Abdul Hakim , Chris Thomas

0 citations · 45 references · arXiv

α

Published on arXiv

2601.16231

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Audio-only adversarial perturbations achieve up to 96% attack success rate on trimodal audio-video-language models, with Whisper exceeding 97% attack success under severe distortion.

SoundBreak

Novel technique introduced


Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.


Key Contributions

  • Six complementary audio-only adversarial attack objectives targeting different stages of multimodal processing (encoder representations, cross-modal attention, hidden states, output likelihoods)
  • Systematic evaluation across three state-of-the-art audio-video-language models, achieving up to 96% attack success rate at low perceptual distortion (LPIPS ≤ 0.08, SI-SNR ≥ 0)
  • Analysis revealing that attack effectiveness benefits more from extended optimization than increased data scale, and that cross-model transferability remains limited

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes gradient-based adversarial audio perturbations that cause misclassification and incorrect outputs in trimodal models at inference time — a classic input manipulation attack. Six complementary attack objectives target encoder representations, cross-modal attention, hidden states, and output likelihoods.


Details

Domains
audiomultimodalnlp
Model Types
multimodalllmtransformer
Threat Tags
white_boxuntargeteddigitalinference_time
Applications
multimodal reasoningaudio-video question answeringspeech recognition