attack 2026

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Zheng Gao ¹, Xiaoyu Li ¹, Zhicheng Bao ¹, Xiaoyan Feng ², Jiaojiao Jiang ¹

¹ University of New South Wales

² Griffith University

0 citations · 11 references · arXiv (Cornell University)

Published on arXiv

2602.21593

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking schemes, exposing a fundamental security weakness when confronted with LLM-driven semantic perturbations.

CSI (Coherence-Preserving Semantic Injection)

Novel technique introduced

Generative images have proliferated on Web platforms in social media and online copyright distribution scenarios, and semantic watermarking has increasingly been integrated into diffusion models to support reliable provenance tracking and forgery prevention for web content. Traditional noise-layer-based watermarking, however, remains vulnerable to inversion attacks that can recover embedded signals. To mitigate this, recent content-aware semantic watermarking schemes bind watermark signals to high-level image semantics, constraining local edits that would otherwise disrupt global coherence. Yet, large language models (LLMs) possess structured reasoning capabilities that enable targeted exploration of semantic spaces, allowing locally fine-grained but globally coherent semantic alterations that invalidate such bindings. To expose this overlooked vulnerability, we introduce a Coherence-Preserving Semantic Injection (CSI) attack that leverages LLM-guided semantic manipulation under embedding-space similarity constraints. This alignment enforces visual-semantic consistency while selectively perturbing watermark-relevant semantics, ultimately inducing detector misclassification. Extensive empirical results show that CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking, revealing a fundamental security weakness of current semantic watermark designs when confronted with LLM-driven semantic perturbations.

Key Contributions

Identifies a fundamental vulnerability in content-aware semantic watermarking: LLMs can explore semantic spaces to find locally fine-grained yet globally coherent alterations that invalidate watermark-semantic bindings
Proposes CSI (Coherence-Preserving Semantic Injection), an LLM-guided attack using embedding-space similarity constraints to maintain visual-semantic consistency while disrupting watermark detection
Empirically demonstrates that CSI consistently outperforms existing attack baselines against content-aware semantic watermarking schemes in diffusion models

🛡️ Threat Analysis

Output Integrity Attack

The paper directly attacks content watermarks embedded in AI-generated image outputs for provenance tracking — the watermarks reside in model outputs (not model weights), making this a watermark removal/evasion attack under Output Integrity. The CSI attack causes detector misclassification, undermining the content authenticity and provenance tracking goals of semantic watermarking schemes.

Details

Domains

visiongenerativenlp

Model Types

diffusionllm

Threat Tags

black_boxinference_timetargeteddigital

Applications

ai-generated image provenance trackingcontent watermarkingdigital rights managementdeepfake/forgery prevention

Read PDF arXiv DOI

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Naïve Exposure of Generative AI Capabilities Undermines Deepfake Detection

Cryptanalysis of Pseudorandom Error-Correcting Codes

MarkSweep: A No-box Removal Attack on AI-Generated Image Watermarking via Noise Intensification and Frequency-aware Denoising

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

The Coding Limits of Robust Watermarking for Generative Models

Attacks on Approximate Caches in Text-to-Image Diffusion Models

DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses