defense 2025

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen 1,2, Mayi Xu 1, Haoyang Chen 1,2, Xiaohu Li 1, Xiangyu Zhang 1,2, Jianjie Huang 1,2, Zheng Wang 3,2, Xiaochun Cao 4,2, Tieyun Qian 4,2

0 citations

α

Published on arXiv

2508.12897

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Fine-tuning LRMs with the PGA dataset achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks without degrading general reasoning capabilities.

RAJ-PGA

Novel technique introduced


Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.


Key Contributions

  • RAJ (Reasoning-Activated Jailbreak) via Concretization: a novel attack showing that refining malicious prompts into specific, detailed instructions triggers vertical thinking in LRMs that overrides safety guardrails in the chain-of-thought phase
  • PGA (Principle-Guided Alignment) framework: a scalable pipeline that uses RAJ-elicited harmful reasoning traces as raw material, transforming them into safe, educational responses for alignment training
  • PGA dataset of 3,989 verified samples that, when used for fine-tuning, achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks while preserving reasoning capability

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
JailbreakBench (JBB)PGA dataset (self-constructed, 3,989 samples)
Applications
large reasoning modelschatbot safety alignmentai safety