defense 2025

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Chentao Cao 1,2, Xiaojun Xu 1, Bo Han 2, Hang Li 1

0 citations

α

Published on arXiv

2509.11629

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Answer-Then-Check achieves the Pareto frontier of safety vs. over-refusal, with 500 training examples matching full-dataset performance and models retaining general reasoning capabilities on MMLU, MATH500, and HumanEval.

Answer-Then-Check (ReSA)

Novel technique introduced


As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to directly answer the question in their thought and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks. Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion. Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm). Furthermore, we discover that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset, suggesting that safety alignment may require less data than previously assumed.


Key Contributions

  • Answer-Then-Check (ATC) alignment paradigm where the model generates a direct answer in its internal reasoning chain, then evaluates its safety before deciding whether to output it to the user
  • ReSA dataset of 80K training examples teaching models to reason through responses and analyze their safety, enabling safe completion (helpful alternatives) rather than only outright refusals
  • Empirical finding that training on as few as 500 examples achieves safety performance comparable to the full 80K dataset, suggesting safety alignment is data-efficient

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_time
Datasets
ReSA (80K, constructed)MMLUMATH500HumanEval
Applications
llm safety alignmentchatbot jailbreak defensesensitive topic handling