attack arXiv Jan 13, 2026 · 11w ago
Fahad Shamshad, Nils Lukas, Karthik Nandakumar · MBZUAI · Michigan State University
Attacks invisible image watermarks by reformulating removal as novel view synthesis using zero-shot diffusion, defeating 15 schemes without detector access.
Output Integrity Attack visiongenerative
Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
diffusion MBZUAI · Michigan State University
benchmark arXiv Nov 24, 2025 · Nov 2025
Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more
Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning
Output Integrity Attack Transfer Learning Attack visiongenerative
Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.
diffusion Mohamed bin Zayed University of Artificial Intelligence · Michigan State University · University of Waterloo