benchmark 2025

Investigating self-supervised representations for audio-visual deepfake detection

Dragos-Alexandru Boldisor 1,2, Stefan Smeu 1,2, Dan Oneata 2,1, Elisabeta Oneata 2

0 citations · 84 references · arXiv

α

Published on arXiv

2511.17181

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Most self-supervised features encode complementary deepfake-relevant information and attend to manipulated regions, yet none generalize reliably across datasets — generalization failure is attributed to dataset characteristics rather than superficial feature learning.

Linear probing evaluation framework

Novel technique introduced


Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.


Key Contributions

  • Systematic linear-probing evaluation of diverse SSL representations (CLIP, Wav2Vec2, AV-HuBERT, DINO, MAE, HuBERT) for audio-visual deepfake detection across multiple datasets and generative models
  • Interpretability analysis via temporal and spatial localization showing models attend to semantically meaningful manipulated regions rather than spurious artifacts, with alignment to human annotations
  • Cross-modal complementarity study revealing that audio, visual, and multimodal SSL features encode distinct and complementary deepfake-relevant information despite consistent cross-dataset generalization failure

🛡️ Threat Analysis

Output Integrity Attack

Paper directly targets AI-generated content detection — specifically audio-visual deepfakes — evaluating whether self-supervised features can reliably authenticate video content across audio, visual, and multimodal modalities.


Details

Domains
audiovisionmultimodal
Model Types
transformermultimodal
Threat Tags
inference_time
Datasets
FakeAVCelebExDDVChandra2025 in-the-wild deepfake dataset
Applications
audio-visual deepfake detectionvideo content authentication