Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran 1, Xin Wang 2, Wanying Ge 2, Xuechen Liu 2, Junichi Yamagishi 2
Published on arXiv
2602.22658
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Fine-tuned Whisper achieves low synthetic-word detection error rates on in-domain data and matches dedicated ResNet performance on out-of-domain data, while preserving transcription accuracy.
Whisper-based Deepfake Word Detector
Novel technique introduced
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.
Key Contributions
- Fine-tuning Whisper for joint ASR and word-level synthetic speech detection via next-token prediction with minimal architectural change (special boundary tokens)
- Investigation of partially vocoded utterances as cheaper training data to reduce data collection costs
- Empirical comparison against a dedicated ResNet-based detector on both in-domain and out-of-domain (unseen speech-generative models) test sets
🛡️ Threat Analysis
Proposes a novel deepfake audio detection method — specifically detecting AI-synthesized words embedded within real utterances. This is AI-generated content detection (audio deepfake detection), a core ML09 concern around output integrity and content authenticity.