attack 2025

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

0 citations

Published on arXiv

2508.10390

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DH-CoT outperforms SOTA jailbreak methods H-CoT and TAP on GPT-5 and Claude-4, particularly improving attack success on reasoning models where prior black-box attacks degrade significantly.

DH-CoT

Novel technique introduced

Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.

Key Contributions

DH-CoT: a jailbreak attack combining adversarial context alignment, NTP-based few-shot examples, and a fake chain-of-thought in a developer template, effective against reasoning-capable LLMs like GPT-5 and Claude-4
MDH: a malicious content detection framework integrating LLM-based annotation with human review to filter low-quality red-teaming samples from evaluation datasets
RTA: a cleaned red-teaming dataset suite that removes borderline prompts (BPs, NHPs, NTPs) for more accurate evaluation of jailbreak attack gains

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

SafeBenchQuestionSetJailbreakStudyBeaverTailsMaliciousEducatorRTA

Applications

llm safety alignmentchatbot systems

Read PDF arXiv

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

StealthGraph: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Adversarial versification in portuguese as a jailbreak operator in LLMs

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models