defense 2025

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Xiangman Li , Xiaodong Wu , Qi Li , Jianbing Ni , Rongxing Lu

Queen’s University

0 citations

Published on arXiv

2508.15182

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafeLLM substantially reduces jailbreak attack success rates on Vicuna, LLaMA, and GPT-J while maintaining general-purpose performance, outperforming SFT and DPO baselines in safety guarantees and robustness to unseen attacks.

SafeLLM

Novel technique introduced

Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

Key Contributions

Three-stage pipeline: dynamic unsafe output detection via hybrid external classifiers + model-internal evaluation, token-level harmful content tracing through FFN activations, and constrained adversarial optimization to suppress unsafe behavior.
First token-level unlearning defense against jailbreak attacks, enabling targeted and irreversible forgetting of harmful knowledge substructures within FFN layers.
Demonstrated stronger safety guarantees and robustness to unseen attacks compared to SFT and DPO baselines, while preserving general-purpose model performance on Vicuna, LLaMA, and GPT-J.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetraining_time

Datasets

multiple jailbreak benchmarks (unnamed in excerpt)

Applications

large language model safetyjailbreak defenseharmful content suppression

Read PDF arXiv

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Mitigating Jailbreaks with Intent-Aware LLMs

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment