ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.

Key Contributions

Fully self-supervised face forgery detection approach using personalized audio-to-expression diffusion models, eliminating dependence on labeled deepfake data
Identity distance metric computed via diffusion reconstruction errors between suspected videos and personalized subject models enabling person-of-interest detection
Demonstrated zero-shot generalization to unseen manipulations including Sora2-generated videos and robustness to real-world corruptions (blur, compression)

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection (face forgery/deepfakes) with a novel detection architecture — classifies as output integrity given it authenticates whether video content is genuine or AI-manipulated.

Details

Domains

visionaudiomultimodalgenerative

Model Types

diffusionmultimodal

Threat Tags

inference_timeblack_box

Datasets

DF-TIMITDFDCPKoDFIDForge

Applications

2025 0 cit.

Output Integrity Attack

71%

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

mAVE: A Watermark for Joint Audio-Visual Generation Models

Robust and Calibrated Detection of Authentic Multimedia Content

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling