Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B
Shuyi Lin 1, Tianyu Lu 1, Zikai Wang 1, Bo Wen 1,2, Yibo Zhao 1, Cheng Tan 1
Published on arXiv
2509.23882
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Chain-Oriented Prompting achieves 80% success rate for malicious command execution and Schrodinger's compliance raises jailbreak rate from 3.3% to 44.4% by exploiting policy paradoxes.
Chain-Oriented Prompting (COP)
Novel technique introduced
OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.
Key Contributions
- Discovery and characterization of five novel failure modes in GPT-OSS-20B (quant fever, reasoning blackhole, Schrodinger's compliance, reasoning procedure mirage, chain-oriented prompting)
- Chain-Oriented Prompting (COP) attack that decomposes malicious objectives into benign-looking sequential steps, achieving 80% success rate for executing rm -rf * and 70% for SSH key exfiltration
- Reasoning Procedure Mirage attack that exploits CoT structure over content, outperforming content-based CoT injection by 26.9% (28.4% → 55.3%)