SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
Zhuohao Yu , Xingru Jiang , Weizheng Gu , Yidong Wang , Qingsong Wen , Shikun Zhang , Wei Ye
Published on arXiv
2508.08211
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves 99.7% F1 on English text with strong multi-bit detection accuracy across 4 datasets while preserving text quality, using only black-box model access
SAEMark
Novel technique introduced
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
Key Contributions
- Post-hoc multi-bit watermarking framework using feature-based rejection sampling that requires only black-box LLM access and never modifies model logits or requires training
- Instantiation via Sparse Autoencoders (SAEs) as deterministic feature extractors, enabling multilingual generalization across English, Chinese, and code
- Theoretical worst-case guarantees relating watermark detection accuracy to computational budget, plus 99.7% F1 on English across 4 datasets
🛡️ Threat Analysis
SAEMark embeds watermarks in LLM-generated text outputs (not model weights) for content provenance and attribution — a direct output integrity defense. The watermark tracks who produced the content, not who owns the model, placing it firmly in ML09 content watermarking.