defense 2025

Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

Asim Mohamed 1,2, Martin Gubri 2

0 citations · 24 references · arXiv

α

Published on arXiv

2510.18019

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

STEAM achieves average gains of +0.19 AUC and +40 percentage points TPR@1% FPR across 17 languages against translation attacks compared to existing multilingual watermarking methods.

STEAM

Novel technique introduced


Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.


Key Contributions

  • Empirically demonstrates that existing multilingual LLM watermarking methods fail for medium- and low-resource languages under translation attacks, tracing the failure to sparse tokenizer vocabularies
  • Proposes STEAM, a back-translation-based watermark detection method that restores watermark signal after translation and is compatible with any underlying watermarking scheme
  • Achieves average gains of +0.19 AUC and +40%p TPR@1% across 17 languages without modifying the base watermarking pipeline

🛡️ Threat Analysis

Output Integrity Attack

The paper is squarely about watermarking LLM-generated text outputs to trace provenance, and defends against translation attacks that erase watermark signals — a direct output integrity and content provenance concern.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
17-language multilingual evaluation benchmark
Applications
llm text watermarkingai-generated text detectionmultilingual content provenance