defense 2025

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen 1, Shih-Peng Cheng 1, Jiawei Du 1, Lin Zhang 2, Xiaoxiao Miao 3, Chung-Che Wang 1, Haibin Wu , Hung-yi Lee 1, Jyh-Shing Roger Jang 1

0 citations

α

Published on arXiv

2508.02000

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

HBMNet outperforms BA-TFD, BA-TFD+, and UMMAFormer on AV-Deepfake-1M for content-driven partial audio-visual deepfake localization and shows improved scalability with more training data.

HBMNet (Hierarchical Boundary Modeling Network)

Novel technique introduced


Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.


Key Contributions

  • HBMNet: a unified framework combining audio-visual feature encoding/fusion with hierarchical (coarse proposal + fine-grained frame-level) boundary modeling for temporal deepfake localization
  • Bidirectional boundary-content probability modeling that captures both real-to-fake and fake-to-real transitions simultaneously
  • Empirical finding that audio-visual fusion primarily improves precision while frame-level supervision boosts recall, with each module delivering complementary benefits

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel deepfake detection architecture for localizing AI-generated audio-visual manipulations — output integrity and AI-generated content detection. The contribution is a new forensic detection method (HBMNet), not a mere application of existing detectors.


Details

Domains
multimodalaudiovision
Model Types
transformermultimodal
Threat Tags
inference_time
Datasets
AV-Deepfake-1MLAV-DF
Applications
audio-visual deepfake detectiontemporal deepfake localization