Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection
Chengcan Wu 1, Zeming Wei 1, Huanran Chen 1, Yinpeng Dong 1, Meng Sun 2
Published on arXiv
2508.15449
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
MRP achieves SOTA unlearning effectiveness using irreversible hidden-state projections, successfully defending against relearning attacks while preserving benign model utility.
Metamorphosis Representation Projection (MRP)
Novel technique introduced
While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in https://github.com/ChengcanWu/MRP.
Key Contributions
- Metamorphosis Representation Projection (MRP): applies irreversible projective transformations in the hidden state space of specific LLM layers to permanently eliminate harmful information rather than merely suppressing it
- Demonstrates effective continuous unlearning that successfully resists relearning attacks — adversarial fine-tuning cannot recover the eliminated harmful knowledge
- Achieves state-of-the-art unlearning effectiveness while preserving the model's natural performance on benign tasks
🛡️ Threat Analysis
The core adversarial threat the paper defends against is the 'relearning attack' — where an adversary fine-tunes an unlearned model on a small amount of harmful data to recover suppressed harmful capabilities. This directly exploits the fine-tuning/transfer learning process to resurrect hidden behavior, fitting ML07's scope. MRP counters this by making projective transformations irreversible so fine-tuning cannot recover the eliminated information.