defense arXiv Mar 16, 2026 · 21d ago
Zhuoshang Wang, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more
Black-box framework for third-party watermark detection in LLM outputs using proxy models and statistical tests
Output Integrity Attack nlp
While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.
llm Chinese Academy of Sciences · University of Chinese Academy of Sciences · National Computer Network Emergency Response Technical Team/Coordination Center of China
defense arXiv Sep 19, 2025 · Sep 2025
Xiaowei Zhu, Yubing Ren, Fang Fang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more
Zero-shot AI text detector using DNA-inspired mutation-repair scoring to distinguish LLM-generated from human-written text at SOTA accuracy
Output Integrity Attack nlp
The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-DetectLLM, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets. Code and data are available at https://github.com/Xiaoweizhu57/DNA-DetectLLM.
llm transformer Chinese Academy of Sciences · University of Chinese Academy of Sciences · Macau University of Science and Technology +1 more