Cross-Lingual Summarization as a Black-Box Watermark Removal Attack
Published on arXiv
2510.24789
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
CLSA reduces XSIR watermark detection AUROC from 0.827 (paraphrasing) and 0.823 (CWRA) down to 0.53 (near chance) while preserving semantic utility across five diverse languages
CLSA (Cross-Lingual Summarization Attack)
Novel technique introduced
Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.
Key Contributions
- Defines CLSA — a black-box watermark removal pipeline using commodity translation (M2M100) and summarization (mT5/XLSum) models without requiring access to watermark keys or detector internals
- Multi-detector (KGW, SIR, XSIR, Unigram) and multi-language (Amharic, Chinese, Hindi, Spanish, Swahili) evaluation showing CLSA consistently outperforms monolingual paraphrase and prior CWRA attacks at similar quality levels
- Mechanistic analysis attributing effectiveness to the joint effect of cross-lingual tokenization disruption and summarization-induced length compression collapsing seeded-token biases
🛡️ Threat Analysis
Proposes a black-box watermark removal attack targeting token-distribution watermarks embedded in LLM text outputs (KGW, SIR, XSIR, Unigram) — directly attacking content provenance and output integrity. The watermarks reside in generated text, not model weights, making this ML09, not ML05.