attack 2026

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

Yuheng Chen ¹, Zhiyu Wu ², Bowen Cheng ³, Tetsuro Takahashi ¹

¹ Kagoshima University

² Fudan University

³ China University of Petroleum-Beijing

0 citations

Published on arXiv

2604.16916

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Forced-choice MCQs yield near-saturation policy violation rates across 14 models, with model-generated MCQs exhibiting robust cross-model transferability compared to inverted U-shaped patterns for human-authored inputs

Forced-Choice MCQ Jailbreak

Novel technique introduced

Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.

Key Contributions

Identifies forced-choice MCQ reformulation as a systematic failure mode that bypasses LLM safety alignment without semantic obfuscation
Demonstrates inverted U-shaped violation pattern across 7 constraint levels for human-authored MCQs, with near-saturation rates for model-generated adversarial MCQs
Shows strong cross-model transferability of MCQs generated by high-capability models, revealing safety-capability tension

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timeuntargeted

Datasets

custom human-authored MCQsmodel-generated adversarial MCQs

Applications

chatbotdecision support systemseducational assistants

Read PDF arXiv

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Jailbreaking in the Haystack

Chain-of-Thought Hijacking

Special-Character Adversarial Attacks on Open-Source Language Model