CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at https://github.com/Y1hanChen/COCONUTS.

Key Contributions

CoCoNUTS: a fine-grained benchmark dataset covering six distinct modes of human-AI collaboration in peer review for evaluating AI-generated text detection
CoCoDet: a multi-task learning detector that focuses on review content rather than stylistic cues, improving robustness against paraphrasing attacks
Empirical demonstration that existing general AI-generated text detectors rely on stylistic features and fail to distinguish permissible language refinement from substantive AI content generation

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated text detection system (CoCoDet) and evaluation benchmark (CoCoNUTS) targeting LLM-generated peer review content. The paper explicitly addresses paraphrasing attacks that defeat style-reliant detectors and introduces a content-focused multi-task learning framework to improve robustness — this is output integrity/AI-generated content detection, not a domain-only application.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

CoCoNUTS (proposed)

Applications

2025 0 cit.

Output Integrity Attack

100%