OmniFD: A Unified Model for Versatile Face Forgery Detection

Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD.

Key Contributions

Unified OmniFD framework addressing four face forgery detection tasks (image classification, video classification, spatial localization, temporal localization) within a single model using a shared Swin Transformer encoder
Cross-task interaction module with learnable queries that dynamically captures inter-task dependencies via attention-based reasoning, enabling knowledge transfer across tasks
63% reduction in model parameters and 50% reduction in training time compared to task-specific models, while achieving superior or competitive performance across diverse FFD benchmarks

🛡️ Threat Analysis

Output Integrity Attack

Deepfake and face forgery detection is explicitly an output integrity / AI-generated content detection task. OmniFD proposes a novel unified architecture for detecting manipulated faces across four tasks (image classification, video classification, spatial localization, temporal localization), which is a novel forensic detection contribution, not merely applying existing methods to a domain.

Details

Domains

vision

Model Types

transformer

Threat Tags

inference_time

Datasets

FaceForensics++ForgeryNetFMLDLAV-DFAVDeepfake1MCelebDFWildDeepfakeDFDC

Applications

2026 0 cit.

Output Integrity Attack

89%