Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles
Minh-Khoa Le-Phan , Minh-Hoang Le , Trong-Le Do , Minh-Triet Tran
Published on arXiv
2604.25889
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieved Fourth Place in NTIRE 2026 Robust Deepfake Detection Challenge at CVPR with stable zero-shot generalization under compound degradations
Calibrated Complementary Ensembles
Novel technique introduced
Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.
Key Contributions
- Foundation-driven framework combining 14-dataset training pool with 18-operation compound degradation pipeline to extract robust facial geometry
- Multi-stream architecture (Localized Facial, Global Texture, Hybrid Semantic Fusion via CLIP) that mitigates spatial attention drift under real-world noise
- Calibrated ensemble voting mechanism validated via Score-CAM and cosine similarity analysis to extract complementary forensic features
🛡️ Threat Analysis
Deepfake detection is AI-generated content detection — verifying authenticity of face images and identifying synthetic/manipulated content. This is output integrity and content provenance, the core focus of ML09.