Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection
Ke Sun , Guangsheng Bao , Han Cui , Yue Zhang
Published on arXiv
2602.01240
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
DetectRouter achieves consistent improvements over fixed-surrogate baselines across all six detection criteria on EvoBench and MAGE, with relative gains up to 139.4%.
DetectRouter
Novel technique introduced
Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.
Key Contributions
- Identifies surrogate-source mismatch as a critical failure mode in zero-shot LLM-generated text detection and derives a KL-divergence-based performance bound
- Proposes DetectRouter, a prototype-based two-stage routing framework that learns text-detector affinity and dynamically selects the optimal surrogate per input
- Demonstrates consistent state-of-the-art improvements on EvoBench and MAGE benchmarks, with relative gains ranging from 5.4% to 139.4% across detection criteria
🛡️ Threat Analysis
Core contribution is AI-generated text detection — a novel architecture (DetectRouter) that routes inputs to the best-matched surrogate for zero-shot LLM-generated text detection, directly addressing output integrity and content authenticity.