defense arXiv Sep 25, 2025 · Sep 2025
Duc-Tuan Truong, Tianchi Liu, Junjie Li et al. · Nanyang Technological University · National University of Singapore +1 more
Novel dual-path training with gradient alignment reduces gradient conflicts in data-augmented speech deepfake detectors, cutting EER by 18.69%
Output Integrity Attack audio
In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual-path data-augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.
transformer cnn Nanyang Technological University · National University of Singapore · The Hong Kong Polytechnic University
defense arXiv Sep 25, 2025 · Sep 2025
Duc-Tuan Truong, Tianchi Liu, Ruijie Tao et al. · Nanyang Technological University · National University of Singapore +1 more
Novel multi-centroid one-class learning detector for speech deepfakes using MOS-based quality-aware centroids and ensemble scoring
Output Integrity Attack audio
Recent work shows that one-class learning can detect unseen deepfake attacks by modeling a compact distribution of bona fide speech around a single centroid. However, the single-centroid assumption can oversimplify the bona fide speech representation and overlook useful cues, such as speech quality, which reflects the naturalness of the speech. Speech quality can be easily obtained using existing speech quality assessment models that estimate it through Mean Opinion Score. In this paper, we propose QAMO: Quality-Aware Multi-Centroid One-Class Learning for speech deepfake detection. QAMO extends conventional one-class learning by introducing multiple quality-aware centroids. In QAMO, each centroid is optimized to represent a distinct speech quality subspaces, enabling better modeling of intra-class variability in bona fide speech. In addition, QAMO supports a multi-centroid ensemble scoring strategy, which improves decision thresholding and reduces the need for quality labels during inference. With two centroids to represent high- and low-quality speech, our proposed QAMO achieves an equal error rate of 5.09% in In-the-Wild dataset, outperforming previous one-class and quality-aware systems.
transformer Nanyang Technological University · National University of Singapore · The Hong Kong Polytechnic University