defense 2025

Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo , Eunsang Lee , Jiyoung Lee

0 citations · 45 references · arXiv

α

Published on arXiv

2510.27475

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Referee achieves state-of-the-art deepfake detection performance on cross-dataset and cross-language evaluation protocols by leveraging cross-modal identity consistency from one-shot reference examples.

Referee

Novel technique introduced


Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.


Key Contributions

  • Reference-aware deepfake detection using one-shot speaker-specific identity cues to go beyond spatiotemporal artifact detection
  • Cross-modal alignment mechanism that jointly reasons about audiovisual synchrony and identity consistency via matched identity queries
  • State-of-the-art cross-dataset and cross-language generalization on FakeAVCeleb, FaceForensics++, and KoDF

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel method for detecting AI-generated audiovisual deepfakes — directly addresses output integrity and content authenticity by verifying whether audio/video has been synthetically manipulated.


Details

Domains
visionaudiomultimodal
Model Types
transformermultimodal
Threat Tags
inference_time
Datasets
FakeAVCelebFaceForensics++KoDF
Applications
audiovisual deepfake detectionface forgery detectionspeaker identity verification