Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.

Key Contributions

Identifies surrogate-source mismatch as a critical failure mode in zero-shot LLM-generated text detection and derives a KL-divergence-based performance bound
Proposes DetectRouter, a prototype-based two-stage routing framework that learns text-detector affinity and dynamically selects the optimal surrogate per input
Demonstrates consistent state-of-the-art improvements on EvoBench and MAGE benchmarks, with relative gains ranging from 5.4% to 139.4% across detection criteria

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is AI-generated text detection — a novel architecture (DetectRouter) that routes inputs to the best-matched surrogate for zero-shot LLM-generated text detection, directly addressing output integrity and content authenticity.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

EvoBenchMAGE

Applications

2025 0 cit.

Output Integrity Attack

100%