Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Chiyu Zhang 1, Lu Zhou 1,2, Xiaogang Xu 3, Jiafei Wu 4, Liming Fang 1, Zhe Liu 4
1 Nanjing University of Aeronautics and Astronautics
2 Collaborative Innovation Center of Novel Software Technology and Industrialization
Published on arXiv
2508.10390
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
DH-CoT outperforms SOTA jailbreak methods H-CoT and TAP on GPT-5 and Claude-4, particularly improving attack success on reasoning models where prior black-box attacks degrade significantly.
DH-CoT
Novel technique introduced
Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.
Key Contributions
- DH-CoT: a jailbreak attack combining adversarial context alignment, NTP-based few-shot examples, and a fake chain-of-thought in a developer template, effective against reasoning-capable LLMs like GPT-5 and Claude-4
- MDH: a malicious content detection framework integrating LLM-based annotation with human review to filter low-quality red-teaming samples from evaluation datasets
- RTA: a cleaned red-teaming dataset suite that removes borderline prompts (BPs, NHPs, NTPs) for more accurate evaluation of jailbreak attack gains