Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
Shrestha Datta 1, Hadi Askari 2, Muhao Chen 2, Shahriar Kabir Nahin 1
Published on arXiv
2510.08592
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Diversity-constrained TTS via RefDiv produces unsafe outputs at rates exceeding those of high-adversarial-intent direct prompts, and existing safety classifiers fail to flag the adversarial inputs across all tested models.
RefDiv
Novel technique introduced
Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.
Key Contributions
- Identifies a previously unrecognized TTS failure mode: reducing candidate diversity, even modestly, significantly increases the rate of unsafe LLM outputs
- Proposes RefDiv (Reference-guided Diversity Reduction), a diagnostic attack that stress-tests TTS pipelines by constraining candidate pool diversity
- Demonstrates that RefDiv-generated adversarial prompts evade widely used safety guardrails (Llama-Guard, OpenAI Moderation API) and transfer across TTS strategies and closed-source models (o3-mini, Gemini-2.5-Pro)