Multi-modal Deepfake Detection and Localization with FPN-Transformer

The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL

Key Contributions

Multi-modal FPN-Transformer framework combining WavLM (audio) and CLIP (video) for cross-modal deepfake detection
R-TLM blocks with localized attention for hierarchical temporal feature extraction across scales
Dual-branch prediction head enabling simultaneous forgery probability prediction and temporal boundary regression for frame-level localization

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel deepfake detection and temporal localization architecture for AI-generated content (GAN- and diffusion-generated audio-visual media) — directly addresses output integrity by authenticating whether media content is synthetic. The architecture (FPN-Transformer with R-TLM blocks) is a novel detection contribution, not merely an application of existing methods.

Details

Domains

multimodalaudiovision

Model Types

transformergandiffusion

Threat Tags

inference_time

Datasets

IJCAI'25 DDL-AV benchmark

Applications

2025 0 cit.

Output Integrity Attack

80%