defense 2026

MSCT: Differential Cross-Modal Attention for Deepfake Detection

Fangda Wei 1, Miao Liu 1, Yingxue Wang 2, Jing Wang 1, Shenghui Zhao 1, Nan Li 2

0 citations

α

Published on arXiv

2604.07741

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves competitive performance on FakeAVCeleb dataset using differential cross-modal attention that better aligns with modal consistency losses

MSCT

Novel technique introduced


Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.


Key Contributions

  • Multi-scale self-attention module to extract temporal features across adjacent video frames
  • Differential cross-modal attention module that improves compatibility between cross-modal attention and alignment loss by focusing on audio-visual inconsistencies in fake videos
  • Achieves competitive performance on FakeAVCeleb dataset for audio-visual deepfake detection

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is detecting deepfake videos (AI-generated synthetic content) by identifying audio-visual inconsistencies — this is AI-generated content detection, which falls under output integrity.


Details

Domains
multimodalaudiovision
Model Types
transformermultimodal
Threat Tags
inference_time
Datasets
FakeAVCeleb
Applications
deepfake detectionsynthetic video detection