attack 2025

LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang , Sangdon Park , Jungseul Ok

0 citations · 42 references · arXiv

α

Published on arXiv

2509.23019

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

BIRA achieves >99% watermark evasion rates across diverse LLM watermarking schemes while preserving semantic fidelity substantially better than prior query-free baselines.

BIRA (Bias-Inversion Rewriting Attack)

Novel technique introduced


Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.


Key Contributions

  • Theoretical proof that reducing the average conditional probability of sampling 'green' tokens by even a small margin causes watermark detection probability to decay exponentially
  • BIRA: a practical query-free attack that applies negative logit bias to a proxy suppression set identified via token surprisal, requiring no access to the original watermarked model
  • State-of-the-art evasion rate (>99%) across diverse watermarking schemes while substantially better preserving semantic fidelity than prior baselines

🛡️ Threat Analysis

Output Integrity Attack

LLM text watermarks are output integrity / content provenance mechanisms (detecting AI-generated text). BIRA is a watermark evasion/removal attack — it defeats the detection signal embedded in model outputs, making AI-generated text undetectable. Attacking content watermarks is canonically ML09.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
llm-generated text detectiontext watermarking