attack 2025

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang ¹, Daqing He ¹, Zijian Zhang ¹, Xin Li ¹, Liehuang Zhu ¹, Meng Li ², Jiamou Liu ³

¹ Beijing Institute of Technology

² Hefei University of Technology

³ The University of Auckland

0 citations · 23 references · arXiv

Published on arXiv

2509.23558

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PASS achieves higher attack success rates with greater stealthiness than baseline jailbreak methods on open-source aligned LLMs

PASS (Prompt Jailbreaking via Semantic and Structural Formalization)

Novel technique introduced

Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

Key Contributions

PASS framework using RL to transform jailbreak prompts into formalized semantic/structural descriptions that evade alignment defenses
GraphRAG system that extracts formalized knowledge from successful jailbreaks to accelerate subsequent attacks
Formal analysis of why formalization-based attacks exploit inherent limitations in current LLM alignment mechanisms

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmrl

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

llm safety alignmentchatbot safety

Read PDF arXiv DOI

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems

Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Jailbreaking Large Language Models Through Content Concretization

Large Reasoning Models Are Autonomous Jailbreak Agents

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity