CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection
Yihan Chen 1,2, Jiawei Chen 1,2, Guozhao Mo 1,2, Xuanang Chen 1, Ben He 1,2, Xianpei Han 1, Le Sun 1
Published on arXiv
2509.04460
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
CoCoDet outperforms existing style-reliant AI-text detectors on peer reviews, particularly maintaining robustness when reviews are paraphrased to evade detection.
CoCoDet
Novel technique introduced
The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at https://github.com/Y1hanChen/COCONUTS.
Key Contributions
- CoCoNUTS: a fine-grained benchmark dataset covering six distinct modes of human-AI collaboration in peer review for evaluating AI-generated text detection
- CoCoDet: a multi-task learning detector that focuses on review content rather than stylistic cues, improving robustness against paraphrasing attacks
- Empirical demonstration that existing general AI-generated text detectors rely on stylistic features and fail to distinguish permissible language refinement from substantive AI content generation
🛡️ Threat Analysis
Proposes a novel AI-generated text detection system (CoCoDet) and evaluation benchmark (CoCoNUTS) targeting LLM-generated peer review content. The paper explicitly addresses paraphrasing attacks that defeat style-reliant detectors and introduces a content-focused multi-task learning framework to improve robustness — this is output integrity/AI-generated content detection, not a domain-only application.