benchmark 2026

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Yixuan Xiao 1,2, Florian Lux 1, Alejandro Pérez-González-de-Martos 2, Ngoc Thang Vu 1

0 citations · 25 references · arXiv (Cornell University)

α

Published on arXiv

2602.16343

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Labeling choice for codec-resynthesized audio (bonafide vs. spoof) significantly impacts deepfake detector performance, with detectors risking learning codec compression artifacts rather than genuine synthesis cues.


Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.


Key Contributions

  • Constructed a challenging dataset extension of ASVspoof 5 specifically designed to study the dual role of neural audio codecs (compression vs. TTS component) in audio deepfake detection
  • Analyzed how labeling codec-resynthesized speech as either bonafide (channel effect) or spoof (synthesis artifact) affects detection performance
  • Provided empirical insights and labeling strategy guidance for the anti-spoofing community when using neural audio codecs for data augmentation

🛡️ Threat Analysis

Output Integrity Attack

The paper directly concerns AI-generated audio (deepfake) detection — specifically how neural audio codec resynthesis blurs the boundary between bonafide and spoofed audio, and how labeling strategies for training detectors affect output integrity verification.


Details

Domains
audio
Model Types
generativetransformer
Threat Tags
training_timeinference_time
Datasets
ASVspoof 5
Applications
audio deepfake detectionspeaker verification anti-spoofingspeech synthesis detection