Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

Key Contributions

HBMNet: a unified framework combining audio-visual feature encoding/fusion with hierarchical (coarse proposal + fine-grained frame-level) boundary modeling for temporal deepfake localization
Bidirectional boundary-content probability modeling that captures both real-to-fake and fake-to-real transitions simultaneously
Empirical finding that audio-visual fusion primarily improves precision while frame-level supervision boosts recall, with each module delivering complementary benefits

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel deepfake detection architecture for localizing AI-generated audio-visual manipulations — output integrity and AI-generated content detection. The contribution is a new forensic detection method (HBMNet), not a mere application of existing detectors.

Details

Domains

multimodalaudiovision

Model Types

transformermultimodal

Threat Tags

inference_time

Datasets

AV-Deepfake-1MLAV-DF

Applications

2026 0 cit.

Output Integrity Attack

93%