defense 2026

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad 1, Ammarah Hashmi 2,3, Junichi Yamagishi 2,4, Yusuke Yasuda 1, Yu Tsao 2, Chia-Wen Lin 4, Yan-Tsung Peng 3, Hsin-Min Wang 2

0 citations

α

Published on arXiv

2603.25140

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves competitive in-domain performance and strong cross-dataset generalization using only real training videos

SAVe

Novel technique introduced


Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.


Key Contributions

  • Self-supervised framework (SAVe) trained exclusively on authentic videos without labeled deepfake data
  • Unifies visual artifact detection (FaceBlend, LipBlend, LowerFaceBlend) with audio-visual synchronization detection (AVSync)
  • Demonstrates strong cross-dataset generalization on FakeAVCeleb and AV-LipSync-TIMIT benchmarks

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is detecting AI-generated audio-visual deepfakes (synthetic talking-face videos) — this is AI-generated content detection for output integrity and authentication. The paper builds a detector to distinguish real from fake multimodal content.


Details

Domains
multimodalvisionaudio
Model Types
multimodalcnntransformer
Threat Tags
inference_time
Datasets
FakeAVCelebAV-LipSync-TIMIT
Applications
deepfake detectionaudio-visual forgery detectiontalking-face authentication