attack 2025

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Guangke Chen ¹, Yuhui Wang ¹, Shouling Ji ², Xiapu Luo ³, Ting Wang ¹

¹ Stony Brook University

² Zhejiang University

³ The Hong Kong Polytechnic University

0 citations · arXiv

Published on arXiv

2511.10913

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HARMGEN attacks substantially reduce refusal rates and increase speech toxicity across five commercial LALM-based TTS systems, while proactive moderation detects only 57–93% of attack instances.

HARMGEN

Novel technique introduced

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

Key Contributions

HARMGEN: five novel jailbreak attacks in two families — semantic obfuscation (Concat, Shuffle) hiding harmful text, and audio-modality exploits (Read, Spell, Phoneme) injecting harmful content via auxiliary audio channels while maintaining benign text prompts
Comprehensive evaluation across five commercial LALM-based TTS systems and three datasets in two languages, demonstrating substantial reductions in refusal rates and increased output toxicity
Analysis of countermeasures revealing deepfake detectors underperform on high-fidelity TTS audio, reactive moderation is bypassable via adversarial perturbations, and proactive moderation catches only 57–93% of attacks

🛡️ Threat Analysis

Details

Domains

audionlpmultimodal

Model Types

llmmultimodal

Threat Tags

black_boxinference_time

Applications

text-to-speech systemsspeech synthesisaudio streaming platforms

Read PDF arXiv DOI

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations