Yuhang Wang

h-index: 5 281 citations 17 papers (total)

Papers in Database (1)

defense arXiv Nov 26, 2025 · Nov 2025

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu et al. · Beijing Jiaotong University · University of International Business and Economics

Defends reasoning LLMs against jailbreaks by synthesizing safety guidelines and fine-tuning with SFT and DPO for adaptive alignment

Prompt Injection nlp
PDF