benchmark 2025

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Sung-Feng Huang 1,2,3, Heng-Cheng Kuo 2,3, Zhehuai Chen 1, Xuesong Yang 1, Chao-Han Huck Yang 1, Yu Tsao 3, Yu-Chiang Frank Wang 1, Hung-yi Lee 2, Szu-Wei Fu 1

0 citations

α

Published on arXiv

2501.03805

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Self-supervised-based detectors achieve strong performance in detection and localization of seamless Voicebox edits even though human listeners find them significantly harder to detect than cut-and-paste edits.

SINE dataset

Novel technique introduced


Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.


Key Contributions

  • Introduces the SINE (Speech INfilling Edit) dataset, the first corpus specifically designed for seamless speech editing detection using Voicebox neural infilling
  • Provides a detailed re-implementation of Voicebox training and a four-category audio generation pipeline (two edited types, two genuine types)
  • Evaluates four SOTA spoof detectors on SINE, showing self-supervised-based models generalize well despite human difficulty in detecting seamless edits

🛡️ Threat Analysis

Output Integrity Attack

The paper directly addresses detection of AI-generated/edited audio content (speech deepfakes created with Voicebox infilling). Creating the SINE dataset and evaluating existing detectors against seamless speech edits is a contribution to the AI-generated content detection research area — a canonical ML09 concern around output integrity and content authenticity.


Details

Domains
audio
Model Types
transformerdiffusion
Threat Tags
inference_time
Datasets
SINE (proposed)HAD dataset
Applications
audio deepfake detectionspeech spoofing detectionpartial speech edit detection