defense 2025

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen ^1,2, Mayi Xu ¹, Haoyang Chen ^1,2, Xiaohu Li ¹, Xiangyu Zhang ^1,2, Jianjie Huang ^1,2, Zheng Wang ^3,2, Xiaochun Cao ^4,2, Tieyun Qian ^4,2

¹ Wuhan University

² Zhongguancun Academy

³ Nankai University

⁴ Sun Yat-sen University

0 citations

Published on arXiv

2508.12897

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Fine-tuning LRMs with the PGA dataset achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks without degrading general reasoning capabilities.

RAJ-PGA

Novel technique introduced

Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.

Key Contributions

RAJ (Reasoning-Activated Jailbreak) via Concretization: a novel attack showing that refining malicious prompts into specific, detailed instructions triggers vertical thinking in LRMs that overrides safety guardrails in the chain-of-thought phase
PGA (Principle-Guided Alignment) framework: a scalable pipeline that uses RAJ-elicited harmful reasoning traces as raw material, transforming them into safe, educational responses for alignment training
PGA dataset of 3,989 verified samples that, when used for fine-tuning, achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks while preserving reasoning capability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

JailbreakBench (JBB)PGA dataset (self-constructed, 3,989 samples)

Applications

large reasoning modelschatbot safety alignmentai safety

Read PDF arXiv Code

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Robust Safety Monitoring of Language Models via Activation Watermarking

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Bias Injection Attacks on RAG Databases and Sanitization Defenses