defense 2026

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Haonan An 1, Guang Hua 2, Wei Du 3, Hangcheng Cao 1, Yihang Tao 1, Guowen Xu 4, Susanto Rahardja 5, Yuguang Fang 1

1 citations · 45 references · TDSC

α

Published on arXiv

2601.11952

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All three DGS variants achieve 100% defense success rate against gradient-based watermark removal attacks across all tested settings on deraining and text-to-image generation tasks.

Decoder Gradient Shields (DGS)

Novel technique introduced


Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.


Key Contributions

  • Identifies a gradient-leakage attack on box-free watermarking decoders where query responses enable training an effective watermark remover
  • Proposes Decoder Gradient Shields (DGS-O, DGS-I, DGS-L) — three provable defense variants that reorient and rescale decoder gradients to prevent watermark remover convergence
  • DGS-O has a closed-form solution and all three variants achieve 100% defense success rate on deraining and image generation tasks while preserving output image quality

🛡️ Threat Analysis

Output Integrity Attack

Box-free model watermarking embeds marks in model-generated image outputs (not model weights), making it content watermarking per the mechanism-based decision tree. The identified attack trains a watermark remover by leveraging backpropagated gradients from decoder queries — a watermark removal attack on content integrity. The proposed DGS family defends against this removal attack by reorienting/rescaling gradients to prevent the remover from converging.


Details

Domains
visiongenerative
Model Types
gandiffusioncnn
Threat Tags
black_boxinference_time
Datasets
deraining task datasetsimage generation benchmarks
Applications
image generationdnn intellectual property protectionbox-free model watermarking