defense 2026

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Zhuoshang Wang 1,2, Yubing Ren 1,2, Yanan Cao 1,2, Fang Fang 1,2, Xiaoxue Li 3, Li Guo 1,2

0 citations

α

Published on arXiv

2603.14968

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves superior detection performance compared to existing methods while enabling independent auditing without compromising model security

TTP-Detect

Novel technique introduced


While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.


Key Contributions

  • First black-box watermark detection framework that works without access to watermarking keys or provider-side detectors
  • Decouples detection from injection by reframing verification as relative hypothesis testing
  • Demonstrates robustness across multiple watermarking schemes and attack scenarios

🛡️ Threat Analysis

Output Integrity Attack

Focuses on verifying output integrity and content provenance by detecting watermarks embedded in LLM-generated text — this is about authenticating and tracing model outputs, which is the core of ML09.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
llm output verificationcontent provenancethird-party auditing