defense 2026

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang ¹, Xianying Luo ¹, Yadong Wang ¹, Xiang Chen ¹, Yu Tian ², Zequn Sun ³, Rui Liu ⁴, Jun Fang ⁴, Naiqiang Tan ⁴, Yuanning Cui ⁵, Sheng-Jun Huang ¹

¹ Nanjing University of Aeronautics and Astronautics

² Tsinghua University

³ Nanjing University

⁴ Didi International Business Group

⁵ Nanjing University of Information Science & Technology

0 citations · arXiv

Published on arXiv

2601.16506

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafeThinker significantly lowers attack success rates across diverse jailbreak strategies including prefilling attacks while preserving utility on benign tasks, outperforming prior shallow alignment baselines

SafeThinker

Novel technique introduced

Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.

Key Contributions

Lightweight gateway classifier that performs risk-based routing to dynamically allocate defensive resources across three complementary mechanisms
Safety-Aware Twin Expert (SATE) module that intercepts deceptive attacks disguised as benign queries without penalizing legitimate users
Distribution-Guided Think (DDGT) component that adaptively intervenes during generation under uncertainty to prevent harmful outputs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

llm safetychatbot safety

Read PDF arXiv DOI

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models