attack 2025

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin 1, Tianyu Lu 1, Zikai Wang 1, Bo Wen 1,2, Yibo Zhao 1, Cheng Tan 1

0 citations · 12 references · arXiv

α

Published on arXiv

2509.23882

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Chain-Oriented Prompting achieves 80% success rate for malicious command execution and Schrodinger's compliance raises jailbreak rate from 3.3% to 44.4% by exploiting policy paradoxes.

Chain-Oriented Prompting (COP)

Novel technique introduced


OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.


Key Contributions

  • Discovery and characterization of five novel failure modes in GPT-OSS-20B (quant fever, reasoning blackhole, Schrodinger's compliance, reasoning procedure mirage, chain-oriented prompting)
  • Chain-Oriented Prompting (COP) attack that decomposes malicious objectives into benign-looking sequential steps, achieving 80% success rate for executing rm -rf * and 70% for SSH key exfiltration
  • Reasoning Procedure Mirage attack that exploits CoT structure over content, outperforming content-based CoT injection by 26.9% (28.4% → 55.3%)

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_boxtargeted
Datasets
StrongReject
Applications
llm safety evaluationagentic llm systemsedge-deployed language models