Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

Key Contributions

Defines CLSA — a black-box watermark removal pipeline using commodity translation (M2M100) and summarization (mT5/XLSum) models without requiring access to watermark keys or detector internals
Multi-detector (KGW, SIR, XSIR, Unigram) and multi-language (Amharic, Chinese, Hindi, Spanish, Swahili) evaluation showing CLSA consistently outperforms monolingual paraphrase and prior CWRA attacks at similar quality levels
Mechanistic analysis attributing effectiveness to the joint effect of cross-lingual tokenization disruption and summarization-induced length compression collapsing seeded-token biases

🛡️ Threat Analysis

Output Integrity Attack

Proposes a black-box watermark removal attack targeting token-distribution watermarks embedded in LLM text outputs (KGW, SIR, XSIR, Unigram) — directly attacking content provenance and output integrity. The watermarks reside in generated text, not model weights, making this ML09, not ML05.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

XLSum (via mT5)300 held-out samples per language (5 languages)

Applications

2026 0 cit.

Output Integrity Attack

82%

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Character-Level Perturbations Disrupt LLM Watermarks

LLM Watermark Evasion via Bias Inversion

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection