Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.

Key Contributions

Fine-tuning Whisper for joint ASR and word-level synthetic speech detection via next-token prediction with minimal architectural change (special boundary tokens)
Investigation of partially vocoded utterances as cheaper training data to reduce data collection costs
Empirical comparison against a dedicated ResNet-based detector on both in-domain and out-of-domain (unseen speech-generative models) test sets

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel deepfake audio detection method — specifically detecting AI-synthesized words embedded within real utterances. This is AI-generated content detection (audio deepfake detection), a core ML09 concern around output integrity and content authenticity.

Details

Domains

audio

Model Types

transformer

Threat Tags

inference_time

Datasets

partially vocoded speech dataset

Applications

2026 0 cit.

Output Integrity Attack

100%

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Segment Transformer: AI-Generated Music Detection via Music Structural Analysis

Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection