Latest papers

1 papers
defense arXiv Sep 29, 2025 · Sep 2025

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Zihao Zhu, Xinyu Wu, Gehan Hu et al. · The Chinese University of Hong Kong · State University of New York at Buffalo +1 more

Adversarial CoT fine-tuning teaches reasoning models to self-correct harmful drifts, improving jailbreak robustness while reducing over-refusal

Prompt Injection nlp
2 citations PDF