SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

Key Contributions

Post-hoc multi-bit watermarking framework using feature-based rejection sampling that requires only black-box LLM access and never modifies model logits or requires training
Instantiation via Sparse Autoencoders (SAEs) as deterministic feature extractors, enabling multilingual generalization across English, Chinese, and code
Theoretical worst-case guarantees relating watermark detection accuracy to computational budget, plus 99.7% F1 on English across 4 datasets

🛡️ Threat Analysis

Output Integrity Attack

SAEMark embeds watermarks in LLM-generated text outputs (not model weights) for content provenance and attribution — a direct output integrity defense. The watermark tracks who produced the content, not who owns the model, placing it firmly in ML09 content watermarking.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

4 unnamed datasets (English, Chinese, code domains)

Applications

2026 0 cit.

Output Integrity Attack

100%

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Every Language Model Has a Forgery-Resistant Signature

SimKey: A Semantically Aware Key Module for Watermarking Language Models

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

Variation is the Key: A Variation-Based Framework for LLM-Generated Text Detection