defense 2026

XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

Jiahao Xu 1,2, Rui Hu 1, Olivera Kotevska 2, Zikai Zhang 1

0 citations

α

Published on arXiv

2604.05242

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Significantly improves decoding accuracy while preserving watermarked text quality, outperforming prior multi-bit watermarking methods especially with limited tokens

XMark

Novel technique introduced


Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textsc{XMark}, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textsc{XMark}'s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textsc{XMark} significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.


Key Contributions

  • Leave-one-Shard-out (LoSo) encoder using k permutations to preserve text quality while embedding messages
  • Constrained token-shard mapping matrix (cTMM) decoder achieving high accuracy with limited tokens
  • Improved trade-off between text quality and decoding accuracy compared to prior multi-bit watermarking methods

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks in LLM-generated text outputs to trace provenance and enable attribution — this is output integrity/content watermarking, not model theft protection.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
text provenancellm attributionmalicious content tracing