attack 2025

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Tomáš Souček ¹, Sylvestre-Alvise Rebuffi ¹, Pierre Fernandez ¹, Nikola Jovanović ², Hady Elsahar ¹, Valeriu Lacatusu ¹, Tuan Tran ¹, Alexandre Mourachko ¹

¹ Meta

² ETH Zurich

0 citations · 47 references · arXiv

Published on arXiv

2510.20468

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Successfully forges and removes post-hoc image watermarks in a realistic black-box, one-shot setting with no access to the watermarking model or paired data, exposing fundamental security weaknesses in current watermarking schemes.

WMForger

Novel technique introduced

Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.

Key Contributions

Image preference model trained via ranking loss on procedurally perturbed images — no real watermarked data or decoding model required
Gradient-based attack that uses the preference model to forge or remove post-hoc image watermarks from a single watermarked example with no knowledge of the watermarking scheme
Comprehensive evaluation across multiple post-hoc watermarking methods demonstrating practical vulnerability of current content watermarking approaches

🛡️ Threat Analysis

Output Integrity Attack

The attack targets content watermarks embedded in image outputs (not model weights), both removing them and forging them onto new images — directly attacking output integrity and content provenance systems. Per the watermarking decision tree, attacking content watermarks maps to ML09.

Details

Domains

vision

Model Types

cnntransformer

Threat Tags

black_boxinference_timedigital

Applications

image watermarkingcontent provenanceai-generated content attribution

Read PDF arXiv DOI Code

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems

Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection

A Novel Unified Approach to Deepfake Detection

Fairness-Aware Deepfake Detection: Leveraging Dual-Mechanism Optimization

Attack-Aware Deepfake Detection under Counter-Forensic Manipulations

ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection

Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection

Open Set Face Forgery Detection via Dual-Level Evidence Collection