α

Published on arXiv

2508.10390

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DH-CoT outperforms SOTA jailbreak methods H-CoT and TAP on GPT-5 and Claude-4, particularly improving attack success on reasoning models where prior black-box attacks degrade significantly.

DH-CoT

Novel technique introduced


Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.


Key Contributions

  • DH-CoT: a jailbreak attack combining adversarial context alignment, NTP-based few-shot examples, and a fake chain-of-thought in a developer template, effective against reasoning-capable LLMs like GPT-5 and Claude-4
  • MDH: a malicious content detection framework integrating LLM-based annotation with human review to filter low-quality red-teaming samples from evaluation datasets
  • RTA: a cleaned red-teaming dataset suite that removes borderline prompts (BPs, NHPs, NTPs) for more accurate evaluation of jailbreak attack gains

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
SafeBenchQuestionSetJailbreakStudyBeaverTailsMaliciousEducatorRTA
Applications
llm safety alignmentchatbot systems