How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

Key Contributions

Constructed a challenging dataset extension of ASVspoof 5 specifically designed to study the dual role of neural audio codecs (compression vs. TTS component) in audio deepfake detection
Analyzed how labeling codec-resynthesized speech as either bonafide (channel effect) or spoof (synthesis artifact) affects detection performance
Provided empirical insights and labeling strategy guidance for the anti-spoofing community when using neural audio codecs for data augmentation

🛡️ Threat Analysis

Output Integrity Attack

The paper directly concerns AI-generated audio (deepfake) detection — specifically how neural audio codec resynthesis blurs the boundary between bonafide and spoofed audio, and how labeling strategies for training detectors affect output integrity verification.