defense 2025

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang ^1,2, Chris Yuhao Liu ¹, Quan Liu ², Jinglong Pang ¹, Wei Wei ², Yujia Bao ², Yang Liu ¹

¹ University of California, Santa Cruz

² Accenture

2 citations · 71 references · arXiv

Published on arXiv

2511.05784

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DRAGON achieves strong unlearning capability across three representative tasks without requiring retain data or base model fine-tuning, while maintaining general language model utility and supporting continual unlearning

DRAGON

Novel technique introduced

Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.

Key Contributions

DRAGON: a training-free, in-context LLM unlearning framework using chain-of-thought instructions that guards deployed LLMs without modifying base model weights or requiring retain data
A robust detection module combining a trained scoring model with a similarity-based metric into a unified confidence score, enabling adaptive thresholding against distributional shifts and paraphrased adversarial attacks
Novel evaluation metrics and a continual unlearning benchmark setting for practical data-limited scenarios validated across three representative unlearning tasks

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

TOFUWMDP

Applications

llm private data protectionharmful knowledge removalcontinual unlearning

Read PDF arXiv DOI

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

SEAL-Tag: Self-Tag Evidence Aggregation with Probabilistic Circuits for PII-Safe Retrieval-Augmented Generation

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

Adaptive Backtracking for Privacy Protection in Large Language Models

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Contextualized Privacy Defense for LLM Agents