attack 2025

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Gokul Ganesan

0 citations · 9 references · arXiv

α

Published on arXiv

2510.24789

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

CLSA reduces XSIR watermark detection AUROC from 0.827 (paraphrasing) and 0.823 (CWRA) down to 0.53 (near chance) while preserving semantic utility across five diverse languages

CLSA (Cross-Lingual Summarization Attack)

Novel technique introduced


Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.


Key Contributions

  • Defines CLSA — a black-box watermark removal pipeline using commodity translation (M2M100) and summarization (mT5/XLSum) models without requiring access to watermark keys or detector internals
  • Multi-detector (KGW, SIR, XSIR, Unigram) and multi-language (Amharic, Chinese, Hindi, Spanish, Swahili) evaluation showing CLSA consistently outperforms monolingual paraphrase and prior CWRA attacks at similar quality levels
  • Mechanistic analysis attributing effectiveness to the joint effect of cross-lingual tokenization disruption and summarization-induced length compression collapsing seeded-token biases

🛡️ Threat Analysis

Output Integrity Attack

Proposes a black-box watermark removal attack targeting token-distribution watermarks embedded in LLM text outputs (KGW, SIR, XSIR, Unigram) — directly attacking content provenance and output integrity. The watermarks reside in generated text, not model weights, making this ML09, not ML05.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
XLSum (via mT5)300 held-out samples per language (5 languages)
Applications
ai-generated text detectiontext watermarkingllm content provenance