Decoupling Defense Strategies for Robust Image Watermarking
Jiahui Chen 1, Zehang Deng 2, Zeyu Zhang 3, Chaoyang Li 1, Lianchen Jia 1, Lifeng Sun 1
Published on arXiv
2602.20053
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
AdvMark achieves up to 29%, 33%, and 46% bit-accuracy improvement over prior defenses for distortion, regeneration, and adversarial attacks respectively, while maintaining the highest image quality.
AdvMark
Novel technique introduced
Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges: (1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks. To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy. In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality. Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29\%, 33\% and 46\% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.
Key Contributions
- Two-stage decoupled fine-tuning framework (AdvMark) that separately addresses adversarial vulnerability (stage 1) and distortion/regeneration attacks (stage 2), avoiding the clean-accuracy degradation inherent to joint training approaches.
- Adversarial training paradigm that primarily fine-tunes the encoder to move images into a non-attackable region, preserving clean decoder accuracy while gaining adversarial robustness.
- Constrained image optimization with theoretical guarantees and a quality-aware early-stop mechanism for stage 2, balancing robustness against regeneration/distortion attacks with visual quality.
🛡️ Threat Analysis
The paper defends image content watermarks — marks embedded in image outputs to trace provenance — against attacks that destroy or circumvent them (adversarial, regeneration, distortion). This is output integrity protection for watermarked content, not model-weight watermarking for IP protection.