Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.

Key Contributions

Empirically demonstrates that existing multilingual LLM watermarking methods fail for medium- and low-resource languages under translation attacks, tracing the failure to sparse tokenizer vocabularies
Proposes STEAM, a back-translation-based watermark detection method that restores watermark signal after translation and is compatible with any underlying watermarking scheme
Achieves average gains of +0.19 AUC and +40%p TPR@1% across 17 languages without modifying the base watermarking pipeline

🛡️ Threat Analysis

Output Integrity Attack

The paper is squarely about watermarking LLM-generated text outputs to trace provenance, and defends against translation attacks that erase watermark signals — a direct output integrity and content provenance concern.