defense 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho 1,2, Huan Song 2, Arijit Ghosh Chowdhury 2, Haotian An 2, Yawei Wang 2, Rohit Thekkanal 2, Negin Sokhandan 2, Sharlina Keshava 2, Hannah Marlowe 2

1 citations · 28 references · arXiv

α

Published on arXiv

2511.21050

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RLVR breaks the assumed safety-capability tradeoff by maintaining or improving safety guardrails during fine-tuning while simultaneously enhancing reasoning performance across five adversarial safety benchmarks.

RLVR (Reinforcement Learning with Verifiable Rewards)

Novel technique introduced


Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.


Key Contributions

  • First theoretical analysis of safety properties in RLVR, deriving upper bounds on safety drift under KL-constrained optimization and proving conditions under which safety degradation is eliminated
  • Comprehensive empirical demonstration across five adversarial safety benchmarks showing RLVR simultaneously enhances reasoning and maintains or improves safety guardrails
  • Ablation studies examining optimization algorithms, model scale, and task domain effects on the safety-capability tradeoff

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
llm fine-tuningreasoning modelssafety-aligned chatbots