defense 2026

Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Ke Sun , Guangsheng Bao , Han Cui , Yue Zhang

0 citations · 42 references · arXiv (Cornell University)

α

Published on arXiv

2602.01240

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

DetectRouter achieves consistent improvements over fixed-surrogate baselines across all six detection criteria on EvoBench and MAGE, with relative gains up to 139.4%.

DetectRouter

Novel technique introduced


Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.


Key Contributions

  • Identifies surrogate-source mismatch as a critical failure mode in zero-shot LLM-generated text detection and derives a KL-divergence-based performance bound
  • Proposes DetectRouter, a prototype-based two-stage routing framework that learns text-detector affinity and dynamically selects the optimal surrogate per input
  • Demonstrates consistent state-of-the-art improvements on EvoBench and MAGE benchmarks, with relative gains ranging from 5.4% to 139.4% across detection criteria

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is AI-generated text detection — a novel architecture (DetectRouter) that routes inputs to the best-matched surrogate for zero-shot LLM-generated text detection, directly addressing output integrity and content authenticity.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
EvoBenchMAGE
Applications
llm-generated text detectionai content authenticity verification