defense 2025

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filip Sondej ¹, Yushi Yang ²

¹ Jagiellonian University

² University of Oxford

0 citations

Published on arXiv

2509.11816

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CIR achieves over 30x greater reduction in post-attack accuracy vs. the best baseline (Circuit Breakers) while disrupting general performance 30x less, using under 3 GPU-seconds per fact on Llama-3.1-8B.

Collapse of Irrelevant Representations (CIR)

Novel technique introduced

Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause - unlearning targets representations which are too general - and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

Key Contributions

Identifies the root cause of unlearning failure: naive unlearning disrupts general representations shared between harmful and benign capabilities, making it trivially reversible by fine-tuning attacks as soon as general performance degrades by as little as 0.1%.
Proposes CIR (Collapse of Irrelevant Representations), which uses PCA on activations and module-output gradients to identify and collapse common subspaces before computing unlearning updates, targeting only harmful-specific representations.
Introduces an MLP breaking loss that directly targets MLP outputs before residual stream addition, improving unlearning selectivity by 40% over prior residual-stream-based representation engineering.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_timeinference_time

Datasets

WMDP

Applications

llm safety unlearninghazardous knowledge removalbio/cyber-hazard capability suppression

Read PDF arXiv Code

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Safety Alignment Should Be Made More Than Just A Few Attention Heads

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Mitigating Jailbreaks with Intent-Aware LLMs

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Reasoning Up the Instruction Ladder for Controllable Language Models