attack 2025

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Shrestha Datta ¹, Hadi Askari ², Muhao Chen ², Shahriar Kabir Nahin ¹

¹ University of South Florida

² University of California, Davis

1 citations · 53 references · arXiv

Published on arXiv

2510.08592

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Diversity-constrained TTS via RefDiv produces unsafe outputs at rates exceeding those of high-adversarial-intent direct prompts, and existing safety classifiers fail to flag the adversarial inputs across all tested models.

RefDiv

Novel technique introduced

Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.

Key Contributions

Identifies a previously unrecognized TTS failure mode: reducing candidate diversity, even modestly, significantly increases the rate of unsafe LLM outputs
Proposes RefDiv (Reference-guided Diversity Reduction), a diagnostic attack that stress-tests TTS pipelines by constraining candidate pool diversity
Demonstrates that RefDiv-generated adversarial prompts evade widely used safety guardrails (Llama-Guard, OpenAI Moderation API) and transfer across TTS strategies and closed-source models (o3-mini, Gemini-2.5-Pro)

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_boxgrey_box

Applications

llm safetytest-time scaling inferencesafety guardrails

Read PDF arXiv DOI

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Special-Character Adversarial Attacks on Open-Source Language Model

Jailbreaking LLMs via Calibration

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Jailbreaking in the Haystack

SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search