attack arXiv Sep 28, 2025 · Sep 2025
Zhaoqi Wang, Daqing He, Zijian Zhang et al. · Beijing Institute of Technology · Hefei University of Technology +1 more
Attacks LLM alignment with RL-driven formalization of jailbreak prompts combined with GraphRAG knowledge reuse
Prompt Injection nlp
Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
llm rl Beijing Institute of Technology · Hefei University of Technology · The University of Auckland
attack arXiv Jan 9, 2026 · 12w ago
Zhaoqi Wang, Zijian Zhang, Daqing He et al. · Beijing Institute of Technology · University of Auckland +2 more
Jailbreaks aligned LLMs by disguising malicious queries as tool calls and using RL to iteratively escalate response harmfulness across turns
Prompt Injection Insecure Plugin Design nlp
Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underline{i}nteractive \underline{M}ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues guided by real-time harmfulness assessment. Our experiments on widely-used models demonstrate that iMIST achieves higher attack effectiveness, while maintaining low rejection rates. These results reveal critical vulnerabilities in current LLM safety mechanisms and underscore the urgent need for more robust defense strategies.
llm Beijing Institute of Technology · University of Auckland · Qi-AnXin Technology Group Inc. +1 more