defense 2026

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran 1, Xin Wang 2, Wanying Ge 2, Xuechen Liu 2, Junichi Yamagishi 2

0 citations

α

Published on arXiv

2602.22658

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Fine-tuned Whisper achieves low synthetic-word detection error rates on in-domain data and matches dedicated ResNet performance on out-of-domain data, while preserving transcription accuracy.

Whisper-based Deepfake Word Detector

Novel technique introduced


Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.


Key Contributions

  • Fine-tuning Whisper for joint ASR and word-level synthetic speech detection via next-token prediction with minimal architectural change (special boundary tokens)
  • Investigation of partially vocoded utterances as cheaper training data to reduce data collection costs
  • Empirical comparison against a dedicated ResNet-based detector on both in-domain and out-of-domain (unseen speech-generative models) test sets

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel deepfake audio detection method — specifically detecting AI-synthesized words embedded within real utterances. This is AI-generated content detection (audio deepfake detection), a core ML09 concern around output integrity and content authenticity.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
partially vocoded speech dataset
Applications
deepfake audio detectionspeech authenticationautomatic speech recognition