Referee: Reference-aware Audiovisual Deepfake Detection

Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

Key Contributions

Reference-aware deepfake detection using one-shot speaker-specific identity cues to go beyond spatiotemporal artifact detection
Cross-modal alignment mechanism that jointly reasons about audiovisual synchrony and identity consistency via matched identity queries
State-of-the-art cross-dataset and cross-language generalization on FakeAVCeleb, FaceForensics++, and KoDF

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel method for detecting AI-generated audiovisual deepfakes — directly addresses output integrity and content authenticity by verifying whether audio/video has been synthetically manipulated.

Details

Domains

visionaudiomultimodal

Model Types

transformermultimodal

Threat Tags

inference_time

Datasets

FakeAVCelebFaceForensics++KoDF

Applications

2026 0 cit.

Output Integrity Attack

93%