defense 2025

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Yulong Chen ¹, Yadong Liu ¹, Jiawen Zhang ¹, Mu Li ¹, Chao Huang ², Jie Wen ¹

¹ Harbin Institute of Technology

² Sun Yat-sen University

0 citations · 43 references · arXiv

Published on arXiv

2510.20129

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SAID substantially outperforms existing defenses against six jailbreak attacks on five open-source LLMs while preserving benign task performance and incurring minimal computational overhead.

SAID (Self-Activating Internal Defense)

Novel technique introduced

Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering or output modification, which often lack generalizability and compromise model utility while incurring significant computational overhead. In this work, we introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID), which reframes the defense task from external correction to internal capability activation. SAID uniquely leverages the LLM's own reasoning abilities to proactively identify and neutralize malicious intent through a three-stage pipeline: model-native intent distillation to extract core semantics, optimal safety prefix probing to activate latent safety awareness, and a conservative aggregation strategy to ensure robust decision-making. Extensive experiments on five open-source LLMs against six advanced jailbreak attacks demonstrate that SAID substantially outperforms state-of-the-art defenses in reducing harmful outputs. Crucially, it achieves this while preserving model performance on benign tasks and incurring minimal computational overhead. Our work establishes that activating the intrinsic safety mechanisms of LLMs is a more robust and scalable path toward building safer and more reliable aligned AI systems.

Key Contributions

SAID: a training-free, three-stage jailbreak defense pipeline (intent distillation → safety prefix probing → conservative aggregation) that activates LLMs' intrinsic safety capabilities without external fine-tuning
Prefix-based Causal Probing: a lightweight intervention that systematically activates latent safety behavior through targeted prefix manipulation, enabling malicious intent detection without degrading benign utility
Empirical demonstration across five open-source LLMs and six advanced jailbreak attacks showing superior robustness over state-of-the-art defenses with negligible computational overhead

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

AdvBench

Applications

llm safety alignmentjailbreak defenseconversational ai

Read PDF arXiv DOI

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Speculative Safety-Aware Decoding

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

Proactive defense against LLM Jailbreak

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents