Benchmarking Gaslighting Attacks Against Speech Large Language Models
Jinyang Wu 1, Bin Zhu 1, Xiandong Zou 1, Qiquan Zhang 2, Xu Fang 3, Pan Zhou 1
Published on arXiv
2509.19858
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Gaslighting attacks cause an average accuracy drop of 24.3% across five Speech LLMs, with models also exhibiting behavioral vulnerabilities such as unsolicited apologies and refusals.
Gaslighting Attacks
Novel technique introduced
As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.
Key Contributions
- Introduces 'gaslighting attacks' taxonomy with five manipulation strategies (Anger, Cognitive Disruption, Sarcasm, Implicit, Professional Negation) targeting Speech LLMs
- Comprehensive evaluation framework covering 5 Speech/multimodal LLMs on 10,000+ samples across 5 datasets, measuring both performance degradation and behavioral responses (apologies, refusals)
- Acoustic perturbation experiments to assess multi-modal robustness of Speech LLMs under combined speech and prompt manipulation