Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
Xuanjun Chen 1, Shih-Peng Cheng 1, Jiawei Du 1, Lin Zhang 2, Xiaoxiao Miao 3, Chung-Che Wang 1, Haibin Wu , Hung-yi Lee 1, Jyh-Shing Roger Jang 1
Published on arXiv
2508.02000
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
HBMNet outperforms BA-TFD, BA-TFD+, and UMMAFormer on AV-Deepfake-1M for content-driven partial audio-visual deepfake localization and shows improved scalability with more training data.
HBMNet (Hierarchical Boundary Modeling Network)
Novel technique introduced
Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.
Key Contributions
- HBMNet: a unified framework combining audio-visual feature encoding/fusion with hierarchical (coarse proposal + fine-grained frame-level) boundary modeling for temporal deepfake localization
- Bidirectional boundary-content probability modeling that captures both real-to-fake and fake-to-real transitions simultaneously
- Empirical finding that audio-visual fusion primarily improves precision while frame-level supervision boosts recall, with each module delivering complementary benefits
🛡️ Threat Analysis
Proposes a novel deepfake detection architecture for localizing AI-generated audio-visual manipulations — output integrity and AI-generated content detection. The contribution is a new forensic detection method (HBMNet), not a mere application of existing detectors.