defense 2026

MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

Ya Jiang , Massieh Kordi Boroujeny , Surender Suresh Kumar , Kai Zeng

0 citations · 36 references · arXiv

α

Published on arXiv

2601.22246

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

With 54 bits embedded in 300 tokens, MirrorMark improves bit accuracy by 8-12% and correctly identifies up to 11% more watermarked texts at 1% false positive rate compared to prior distortion-free methods.

MirrorMark

Novel technique introduced


As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but existing methods either provide only binary signals or distort the sampling distribution, degrading text quality; distortion-free approaches, in turn, often suffer from weak detectability or robustness. We propose MirrorMark, a multi-bit and distortion-free watermark for LLMs. By mirroring sampling randomness in a measure-preserving manner, MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design. To improve robustness, we introduce a context-based scheduler that balances token assignments across message positions while remaining resilient to insertions and deletions. We further provide a theoretical analysis of the equal error rate to interpret empirical performance. Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability: with 54 bits embedded in 300 tokens, it improves bit accuracy by 8-12% and correctly identifies up to 11% more watermarked texts at 1% false positive rate.


Key Contributions

  • MirrorMark: a distortion-free multi-bit LLM watermark that mirrors sampling randomness in a measure-preserving manner, preserving the token probability distribution by design
  • Context-based scheduler that balances token assignments across message positions, providing robustness to insertions and deletions
  • Theoretical analysis of equal error rate via wrapped Beta distributions for the Gumbel-max construction, linking token probability to detectability

🛡️ Threat Analysis

Output Integrity Attack

MirrorMark embeds watermarks in LLM-generated TEXT OUTPUTS (not model weights) to enable content attribution and provenance verification — this is output integrity and content watermarking, not model IP protection (ML05). The watermark tracks who/which model produced the content.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
text generationcontent attributionai-generated text detection