defense 2026

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

1 citations · 45 references · TDSC

Published on arXiv

2601.11952

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All three DGS variants achieve 100% defense success rate against gradient-based watermark removal attacks across all tested settings on deraining and text-to-image generation tasks.

Decoder Gradient Shields (DGS)

Novel technique introduced

Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.

Key Contributions

Identifies a gradient-leakage attack on box-free watermarking decoders where query responses enable training an effective watermark remover
Proposes Decoder Gradient Shields (DGS-O, DGS-I, DGS-L) — three provable defense variants that reorient and rescale decoder gradients to prevent watermark remover convergence
DGS-O has a closed-form solution and all three variants achieve 100% defense success rate on deraining and image generation tasks while preserving output image quality

🛡️ Threat Analysis

Output Integrity Attack

Box-free model watermarking embeds marks in model-generated image outputs (not model weights), making it content watermarking per the mechanism-based decision tree. The identified attack trains a watermark remover by leveraging backpropagated gradients from decoder queries — a watermark removal attack on content integrity. The proposed DGS family defends against this removal attack by reorienting/rescaling gradients to prevent the remover from converging.

Details

Domains

visiongenerative

Model Types

gandiffusioncnn

Threat Tags

black_boxinference_time

Datasets

deraining task datasetsimage generation benchmarks

Applications

image generationdnn intellectual property protectionbox-free model watermarking

Read PDF arXiv DOI

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection

AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers

Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies

GRRE: Leveraging G-Channel Removed Reconstruction Error for Robust Detection of AI-Generated Images

Unknown Aware AI-Generated Content Attribution