TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents
Chan Naseeb 1, Adeel Ashraf Cheema 2, Hassan Sami 2, Tayyab Afzal 3, Muhammad Omair 2, Usman Habib 2
Published on arXiv
2601.12895
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves 84.31% accuracy, 90.78% AUC for classification, and 57.24% mean Dice score for localization of synthetic manipulations across 10 languages and 3 acquisition devices.
TwoHead-SwinFPN
Novel technique introduced
The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.
Key Contributions
- Unified dual-head architecture combining Swin Transformer backbone with Feature Pyramid Network for simultaneous binary classification and pixel-level localization of synthetic manipulations in ID documents
- CBAM-enhanced decoder with uncertainty-weighted multi-task learning for joint optimization of detection and segmentation objectives
- Cross-device and cross-language generalization evaluation on the FantasyIDiap dataset with FastAPI deployment for real-world inference
🛡️ Threat Analysis
The paper's primary contribution is detecting AI-generated synthetic content (GAN/diffusion-based face swaps and text inpainting) in identity documents. It proposes a novel detection architecture — qualifying as an AI-generated content detection method under output integrity, not merely a domain fine-tune of an existing detector.