defense 2026

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

0 citations

Published on arXiv

2603.11388

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Proposed trigger-aware safety alignment achieves better trade-off between jailbreak defense and responsiveness to benign queries, outperforming prior safety alignment methods across multiple model families.

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

Key Contributions

Mechanistic analysis identifying 'refusal triggers' — linguistic cues in safety training data that cause LLMs to refuse both harmful and benign queries, explaining the overrefusal phenomenon
A safety alignment fine-tuning method that explicitly models refusal triggers and constructs trigger-matched benign supervision to reduce overrefusal
Empirical demonstration across multiple model families and alignment methods showing improved safety-utility trade-off over prior approaches

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Applications

safety-aligned llmschatbots

Read PDF arXiv

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Reasoning Up the Instruction Ladder for Controllable Language Models

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Don't Walk the Line: Boundary Guidance for Filtered Generation

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check