defense 2026

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo 1,2, Qilin Yin 2, Wei Lu 1, Xiangyang Luo 3, Rui Yang 2

0 citations · arXiv

α

Published on arXiv

2601.21458

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

RT-DeepLoc achieves state-of-the-art weakly supervised temporal forgery localization on LAV-DF by leveraging MAE reconstruction errors as forgery indicators without frame-level labels.

RT-DeepLoc

Novel technique introduced


Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.


Key Contributions

  • RT-DeepLoc: a weakly supervised temporal forgery localization framework using a Masked Autoencoder trained exclusively on authentic data to expose forged segments via reconstruction errors
  • Asymmetric Intra-video Contrastive Loss (AICL) that focuses on compactness of authentic features guided by reconstruction cues rather than clustering heterogeneous forgery patterns
  • State-of-the-art weakly supervised temporal deepfake localization on LAV-DF without requiring frame-level annotations

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel detection architecture for identifying and temporally localizing AI-generated/forged content in multimodal videos — directly addresses deepfake content authentication and output integrity.


Details

Domains
visionaudiomultimodal
Model Types
transformermultimodal
Threat Tags
inference_timedigital
Datasets
LAV-DF
Applications
deepfake detectionmultimodal video forgery localizationtemporal forgery localization