TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents

The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.

Key Contributions

Unified dual-head architecture combining Swin Transformer backbone with Feature Pyramid Network for simultaneous binary classification and pixel-level localization of synthetic manipulations in ID documents
CBAM-enhanced decoder with uncertainty-weighted multi-task learning for joint optimization of detection and segmentation objectives
Cross-device and cross-language generalization evaluation on the FantasyIDiap dataset with FastAPI deployment for real-world inference

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is detecting AI-generated synthetic content (GAN/diffusion-based face swaps and text inpainting) in identity documents. It proposes a novel detection architecture — qualifying as an AI-generated content detection method under output integrity, not merely a domain fine-tune of an existing detector.

Details

Domains

vision

Model Types

transformercnn

Threat Tags

inference_timedigital

Datasets

FantasyIDiap

Applications

2026 1 cit.

Output Integrity Attack

100%