defense 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho ^1,2, Huan Song ², Arijit Ghosh Chowdhury ², Haotian An ², Yawei Wang ², Rohit Thekkanal ², Negin Sokhandan ², Sharlina Keshava ², Hannah Marlowe ²

¹ Duke University

² AWS Generative AI Innovation Center

1 citations · 28 references · arXiv

Published on arXiv

2511.21050

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RLVR breaks the assumed safety-capability tradeoff by maintaining or improving safety guardrails during fine-tuning while simultaneously enhancing reasoning performance across five adversarial safety benchmarks.

RLVR (Reinforcement Learning with Verifiable Rewards)

Novel technique introduced

Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

Key Contributions

First theoretical analysis of safety properties in RLVR, deriving upper bounds on safety drift under KL-constrained optimization and proving conditions under which safety degradation is eliminated
Comprehensive empirical demonstration across five adversarial safety benchmarks showing RLVR simultaneously enhances reasoning and maintains or improves safety guardrails
Ablation studies examining optimization algorithms, model scale, and task domain effects on the safety-capability tradeoff

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Applications

llm fine-tuningreasoning modelssafety-aligned chatbots

Read PDF arXiv DOI

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Don't Walk the Line: Boundary Guidance for Filtered Generation

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI