defense arXiv Dec 17, 2025 · Dec 2025
Sarim Hashmi, Abdelrahman Elsayed, Mohammed Talha Alam et al. · Mohamed bin Zayed University of Artificial Intelligence
Resynthesis-based deepfake detector with calibrated low false-positive rates and robustness against adaptive evasion adversaries across modalities
Output Integrity Attack visionmultimodalgenerative
Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.
diffusion multimodal Mohamed bin Zayed University of Artificial Intelligence
benchmark arXiv Nov 24, 2025 · Nov 2025
Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more
Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning
Output Integrity Attack Transfer Learning Attack visiongenerative
Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.
diffusion Mohamed bin Zayed University of Artificial Intelligence · Michigan State University · University of Waterloo