benchmark arXiv Jan 7, 2025 · Jan 2025
Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen et al. · NVIDIA · National Taiwan University +1 more
Benchmark dataset (SINE) for seamless AI speech edit detection, revealing gaps in cut-and-paste-trained detectors
Output Integrity Attack audio
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
transformer diffusion NVIDIA · National Taiwan University · Academia Sinica