defense 2025

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu , Xingru Jiang , Weizheng Gu , Yidong Wang , Qingsong Wen , Shikun Zhang , Wei Ye

0 citations

α

Published on arXiv

2508.08211

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 99.7% F1 on English text with strong multi-bit detection accuracy across 4 datasets while preserving text quality, using only black-box model access

SAEMark

Novel technique introduced


Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.


Key Contributions

  • Post-hoc multi-bit watermarking framework using feature-based rejection sampling that requires only black-box LLM access and never modifies model logits or requires training
  • Instantiation via Sparse Autoencoders (SAEs) as deterministic feature extractors, enabling multilingual generalization across English, Chinese, and code
  • Theoretical worst-case guarantees relating watermark detection accuracy to computational budget, plus 99.7% F1 on English across 4 datasets

🛡️ Threat Analysis

Output Integrity Attack

SAEMark embeds watermarks in LLM-generated text outputs (not model weights) for content provenance and attribution — a direct output integrity defense. The watermark tracks who produced the content, not who owns the model, placing it firmly in ML09 content watermarking.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
4 unnamed datasets (English, Chinese, code domains)
Applications
llm text watermarkingcontent attributionmisinformation prevention