Representation-Guided Parameter-Efficient LLM Unlearning

Large Language Models (LLMs) often memorize sensitive or harmful information, necessitating effective machine unlearning techniques. While existing parameter-efficient unlearning methods have shown promise, they still struggle with the forget-retain trade-off. This can be attributed to their reliance on parameter importance metrics to identify parameters that are important exclusively for the forget set, which is fundamentally limited by the superposition phenomenon. Due to the polysemantic nature of LLM parameters, such an importance metric may struggle to disentangle parameters associated with the forget and retain sets. In this work, we propose Representation-Guided Low-rank Unlearning (REGLU), a novel approach that leverages the geometric properties of representation spaces to achieve robust and precise unlearning. First, we develop a representation-guided initialization for LoRA that identifies the optimal subspace for selective forgetting. Second, we introduce a regularization loss that constrains the outputs of the LoRA update to lie in the orthogonal complement of the retain set's representation subspace, thereby minimizing interference with the model's performance on the retain set. We evaluate REGLU on the TOFU and WMDP benchmarks across multiple models. Our results demonstrate that REGLU consistently outperforms state-of-the-art baselines, achieving superior unlearning quality while maintaining higher model utility.

Key Contributions

RILA: Representation-guided LoRA initialization that identifies optimal subspace for selective forgetting by maximizing forget-set variance while minimizing retain-set variance
ROL: Representation Orthogonal Loss that constrains LoRA updates to lie in orthogonal complement of retain set's representation subspace
State-of-the-art performance on TOFU and WMDP benchmarks with superior forget-retain trade-off compared to parameter-importance-based methods

🛡️ Threat Analysis

Model Inversion Attack

The paper addresses machine unlearning for LLMs to prevent extraction of memorized training data. The TOFU and WMDP benchmarks evaluate unlearning against privacy attacks including membership inference and model extraction, making this a defense against data reconstruction/extraction attacks. The core threat model involves an adversary extracting sensitive or harmful information that the model has memorized.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeblack_box

Datasets

TOFUWMDP

Applications

2026 0 cit.

Model Inversion Attack

86%