Backdoor Attacks Against Speech Language Models
Alexandrine Fortier 1, Thomas Thebaud 2, Jesús Villalba 2, Najim Dehak 2, Patrick Cardinal 1
Published on arXiv
2510.01157
Model Poisoning
OWASP ML Top 10 — ML10
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
Backdoor attacks achieve 90.76%–99.41% attack success rate across four speech encoders (WavLM, HuBERT, wav2vec 2.0, Whisper) while maintaining clean-input accuracy.
Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
Key Contributions
- First systematic study of dirty-label backdoor attacks against a cascaded speech-language model (SpeechLLM), covering four tasks and three datasets
- Component-level analysis isolating the contribution of the audio encoder, projection connector, and LoRA adapters to backdoor propagation
- Fine-tuning-based post-training defense that mitigates the threat of poisoned pretrained speech encoders
🛡️ Threat Analysis
Attacks specifically exploit the transfer learning pipeline: poisoned pretrained SSL encoders (WavLM, HuBERT, wav2vec 2.0, Whisper) propagate backdoors into the downstream SpeechLLM system; the proposed defense is fine-tuning to mitigate poisoned pretrained encoders, squarely a transfer learning threat.
Core contribution is backdoor/trojan injection into speech language models using an audio trigger (clicking noise), causing targeted misclassification on ASR, emotion, gender, and age tasks while behaving normally on clean inputs.