DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
Hao Li 1,2, Yubing Ren 1,2, Yanan Cao 1,2, Yingjie Li 1,2, Fang Fang 1,2, Shi Wang 1, Li Guo 1,2
Published on arXiv
2512.16182
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
DualGuard achieves robust detectability against paraphrase attacks while uniquely enabling detection and tracing of spoofing attacks that inject harmful content into watermarked LLM outputs.
DualGuard
Novel technique introduced
With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.
Key Contributions
- First watermarking algorithm that simultaneously defends against both paraphrase attacks and piggyback spoofing attacks on LLM-generated text
- Adaptive dual-stream watermarking mechanism that dynamically injects two complementary watermark signals based on semantic content
- Spoofing attack traceability: DualGuard can not only detect but also trace injection of harmful content into watermarked text, preventing false attribution
🛡️ Threat Analysis
DualGuard embeds watermark signals in LLM-generated text outputs to ensure content provenance and attribution. The paper defends against paraphrase attacks (watermark removal) and spoofing attacks (injecting harmful content into watermarked text to falsely implicate the LLM), both of which are attacks on output integrity and content authenticity. The watermark is in the TEXT OUTPUT, not model weights, making this ML09 not ML05.