defense 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

1 citations · 65 references · arXiv

Published on arXiv

2510.01088

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SIRL maintains 89%+ Defense Success Rates against 20+ diverse jailbreak methods on Llama and Qwen models, achieving over 6x average vulnerability reduction using only 15,000 unlabeled prompts

SIRL (Safety Instincts Reinforcement Learning)

Novel technique introduced

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

Key Contributions

Discovery that aligned LLMs exhibit consistently lower output entropy on safe refusals than on harmful completions, revealing an exploitable intrinsic safety signal
SIRL: a self-alignment RL method that converts response entropy into an intrinsic reward signal, eliminating dependency on external reward models, human annotations, or content validators
Achieves 89%+ Defense Success Rates against 20+ jailbreak methods using only 15,000 unlabeled prompts, reducing vulnerability 6x over baseline while preserving math, code, and conversational performance

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_timetraining_time

Datasets

AMCHumanEvalLiveCodeBenchBIG-Bench HardMT-Bench

Applications

llm safety alignmentjailbreak defense

Read PDF arXiv DOI

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Mitigating Jailbreaks with Intent-Aware LLMs

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models