ML Security Papers

attack 2025

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin ¹, Tianyu Lu ¹, Zikai Wang ¹, Bo Wen ^1,2, Yibo Zhao ¹, Cheng Tan ¹

¹ Northeastern University

² Shanghai Jiao Tong University

0 citations · 12 references · arXiv

α

Published on arXiv

2509.23882

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Chain-Oriented Prompting achieves 80% success rate for malicious command execution and Schrodinger's compliance raises jailbreak rate from 3.3% to 44.4% by exploiting policy paradoxes.

Chain-Oriented Prompting (COP)

Novel technique introduced

OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.

Key Contributions

Discovery and characterization of five novel failure modes in GPT-OSS-20B (quant fever, reasoning blackhole, Schrodinger's compliance, reasoning procedure mirage, chain-oriented prompting)
Chain-Oriented Prompting (COP) attack that decomposes malicious objectives into benign-looking sequential steps, achieving 80% success rate for executing rm -rf * and 70% for SSH key exfiltration
Reasoning Procedure Mirage attack that exploits CoT structure over content, outperforming content-based CoT injection by 26.9% (28.4% → 55.3%)

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_boxtargeted

Datasets

StrongReject

Applications

llm safety evaluationagentic llm systemsedge-deployed language models

Read PDF arXiv DOI

Similar Papers

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Emoji-Based Jailbreaking of Large Language Models

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs