defense 2025

Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Wenbo Xu 1, Wei Lu 1, Xiangyang Luo 2

0 citations

α

Published on arXiv

2508.02179

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

WMMT achieves temporal forgery localization performance comparable to fully supervised approaches using only video-level annotations in a weakly supervised multimodal setting

WMMT

Novel technique introduced


The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.


Key Contributions

  • Formalizes WS-MTFL (weakly supervised multimodal temporal forgery localization) as a new problem combining multimodal deepfake detection with temporal segment localization using only video-level annotations
  • WMMT framework using Mixture-of-Experts with a modality-aware expert selection mechanism and temporal property preserving attention (TPPA) to capture intra- and inter-modality forgery traces
  • Extensible deviation perceiving loss that maximizes temporal deviation in adjacent segments of forged samples and minimizes it for genuine samples, improving weakly supervised localization

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses deepfake detection and temporal forgery localization — detecting AI-manipulated content in both visual and audio modalities is a canonical ML09 output integrity/content authenticity task. The paper proposes novel detection architecture (WMMT with Mixture-of-Experts, TPPA attention, deviation perceiving loss), not just applying existing methods.


Details

Domains
multimodalvisionaudio
Model Types
transformermultimodal
Threat Tags
inference_time
Applications
deepfake video detectiontemporal forgery localizationmultimodal forgery detection