Investigating self-supervised representations for audio-visual deepfake detection

Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.

Key Contributions

Systematic linear-probing evaluation of diverse SSL representations (CLIP, Wav2Vec2, AV-HuBERT, DINO, MAE, HuBERT) for audio-visual deepfake detection across multiple datasets and generative models
Interpretability analysis via temporal and spatial localization showing models attend to semantically meaningful manipulated regions rather than spurious artifacts, with alignment to human annotations
Cross-modal complementarity study revealing that audio, visual, and multimodal SSL features encode distinct and complementary deepfake-relevant information despite consistent cross-dataset generalization failure

🛡️ Threat Analysis

Output Integrity Attack

Paper directly targets AI-generated content detection — specifically audio-visual deepfakes — evaluating whether self-supervised features can reliably authenticate video content across audio, visual, and multimodal modalities.

Details

Domains

audiovisionmultimodal

Model Types

transformermultimodal

Threat Tags

inference_time

Datasets

FakeAVCelebExDDVChandra2025 in-the-wild deepfake dataset

Applications

2026 0 cit.

Output Integrity Attack

80%