attack 2025

Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge

Ning Xu , Bo Gao , Hui Dou

2 citations · 37 references · arXiv

α

Published on arXiv

2510.01223

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RTS-Attack achieves 96.69% average attack success rate across six state-of-the-art LLMs including GPT-4o and Gemini-1.5-pro with only 96.02 input tokens per query

RTS-Attack

Novel technique introduced


Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs' alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs' alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available at https://github.com/nercode/Work. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.


Key Contributions

  • First to identify and systematically verify that LLM alignment defenses are insensitive to nested scenarios that are both highly semantically relevant to harmful queries AND incorporate targeted toxic knowledge
  • Proposes RTS-Attack, an automated black-box jailbreak framework that executes an attack within three LLM interaction rounds using query classification, nested scenario generation, and customized instructions
  • Achieves 96.69% average attack success rate and harmfulness score of 4.90 across six LLMs including GPT-4o and Gemini-1.5-pro, using only 96.02 input tokens per query

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety alignmentchatbot safety systems