defense 2025

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

Xin Zhang , Jiaming Chu , Jian Zhao , Yuchu Jiang , Xu Yang , Lei Jin , Chi Zhang , Xuelong Li

0 citations

α

Published on arXiv

2508.17282

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves state-of-the-art accuracy and processing speed on the DDL-AV benchmark, winning Track 2 (Audio-Visual Detection and Localization) of the Workshop on Deepfake Detection, Localization, and Interpretability.

ERF-BA-TFD+

Novel technique introduced


Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.


Key Contributions

  • Novel ERF-BA-TFD+ multimodal architecture combining enhanced receptive field (ERF) modules with audio-visual fusion for deepfake detection
  • Long-range dependency modeling across audio and video modalities to capture subtle audio-visual inconsistencies in deepfakes
  • State-of-the-art performance on the DDL-AV dataset (both segmented and full-length clips), winning first place in the DDL-AV workshop competition

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel multimodal deepfake detection architecture specifically designed to identify manipulated audio-visual content — a direct contribution to AI-generated content detection and output integrity verification.


Details

Domains
visionaudiomultimodal
Model Types
multimodal
Threat Tags
inference_timedigital
Datasets
DDL-AV
Applications
deepfake detectionaudio-visual forgery detectionmanipulated video detection