defense 2026

AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

Mintong Kang , Chen Fang , Bo Li

0 citations

α

Published on arXiv

2604.08867

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AudioGuard consistently outperforms strong audio-LLM-based baselines in guardrail accuracy across AudioSafetyBench and four complementary benchmarks while achieving substantially lower latency

AudioGuard

Novel technique introduced


Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just "unsafe text spoken aloud": real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.


Key Contributions

  • AudioSafetyBench: first comprehensive policy-grounded audio safety benchmark covering diverse threat models (audio I/O moderation, TTS/voice cloning misuse, voice agents) with multilingual coverage, celebrity/child voice conditions, and non-speech sound events
  • AudioGuard: unified guardrail architecture combining SoundGuard (waveform-level audio-native detection) and ContentGuard (ASR + semantic policy-grounded protection) for interpretable audio safety decisions
  • Large-scale red teaming of audio-capable AI systems revealing systematic vulnerabilities including audio-native harmful sounds, voice-content compositional risks, and impersonation attacks

🛡️ Threat Analysis

Input Manipulation Attack

Paper addresses input manipulation threats in audio-capable AI systems, including adversarial audio inputs (harmful sound events, voice impersonation) designed to evade safety mechanisms or cause unsafe outputs. The SoundGuard component detects waveform-level adversarial audio signals.

Output Integrity Attack

Paper addresses output integrity for audio generation systems (TTS, voice cloning), detecting and preventing generation of unsafe audio outputs including impersonation misuse and voice-content compositional harms. ContentGuard validates semantic safety of generated audio content.


Details

Domains
audionlpmultimodal
Model Types
llmmultimodaltransformer
Threat Tags
inference_timeblack_boxdigital
Datasets
AudioSafetyBench
Applications
voice assistantstext-to-speech systemsvoice cloning servicesaudio-capable llms