benchmark 2025

Benchmarking Fake Voice Detection in the Fake Voice Generation Arms Race

Xutao Mao , Ke Li , Cameron Baird , Ezra Xuanru Tao , Dan Lin

0 citations · 161 references · arXiv

α

Published on arXiv

2510.06544

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

No single detector is universally robust; neural audio codec and flow-matching generators consistently evade top-tier detectors, revealing a significant generalization gap across all 8 evaluated defenses.


The rapid advancement of fake voice generation technology has ignited a race with detection systems, creating an urgent need to secure the audio ecosystem. However, existing benchmarks suffer from a critical limitation: they typically aggregate diverse fake voice samples into a single dataset for evaluation. This practice masks method-specific artifacts and obscures the varying performance of detectors against different generation paradigms, preventing a nuanced understanding of their true vulnerabilities. To address this gap, we introduce the first ecosystem-level benchmark that systematically evaluates the interplay between 17 state-of-the-art fake voice generators and 8 leading detectors through a novel one-to-one evaluation protocol. This fine-grained analysis exposes previously hidden vulnerabilities and sensitivities that are missed by traditional aggregated testing. We also propose unified scoring systems to quantify both the evasiveness of generators and the robustness of detectors, enabling fair and direct comparisons. Our extensive cross-domain evaluation reveals that modern generators, particularly those based on neural audio codecs and flow matching, consistently evade top-tier detectors. We found that no single detector is universally robust; their effectiveness varies dramatically depending on the generator's architecture, highlighting a significant generalization gap in current defenses. This work provides a more realistic assessment of the threat landscape and offers actionable insights for building the next generation of detection systems.


Key Contributions

  • First ecosystem-level benchmark evaluating 17 fake voice generators against 8 detectors using a one-to-one evaluation protocol that exposes method-specific vulnerabilities hidden by traditional aggregated testing
  • Unified scoring systems quantifying generator evasiveness and detector robustness for fair cross-method comparison
  • Empirical finding that neural audio codec and flow-matching generators consistently evade top-tier detectors, with no single detector showing universal robustness

🛡️ Threat Analysis

Output Integrity Attack

Directly evaluates AI-generated audio (fake voice) detection systems — fake voice generators produce synthetic audio outputs and detectors attempt to verify content authenticity. The benchmark reveals how modern neural codec and flow-matching generators evade deepfake detectors, mapping squarely to output integrity and AI-generated content detection.


Details

Domains
audiogenerative
Model Types
gandiffusiontransformer
Threat Tags
inference_timeblack_box
Applications
fake voice detectionautomatic speaker verificationanti-spoofing