α

Published on arXiv

2604.17396

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves state-of-the-art unlearning quality while maintaining higher model utility across Llama-2-7B, Phi-1.5, and Zephyr-7B-beta by leveraging representation geometry instead of parameter importance metrics

REGLU

Novel technique introduced


Large Language Models (LLMs) often memorize sensitive or harmful information, necessitating effective machine unlearning techniques. While existing parameter-efficient unlearning methods have shown promise, they still struggle with the forget-retain trade-off. This can be attributed to their reliance on parameter importance metrics to identify parameters that are important exclusively for the forget set, which is fundamentally limited by the superposition phenomenon. Due to the polysemantic nature of LLM parameters, such an importance metric may struggle to disentangle parameters associated with the forget and retain sets. In this work, we propose Representation-Guided Low-rank Unlearning (REGLU), a novel approach that leverages the geometric properties of representation spaces to achieve robust and precise unlearning. First, we develop a representation-guided initialization for LoRA that identifies the optimal subspace for selective forgetting. Second, we introduce a regularization loss that constrains the outputs of the LoRA update to lie in the orthogonal complement of the retain set's representation subspace, thereby minimizing interference with the model's performance on the retain set. We evaluate REGLU on the TOFU and WMDP benchmarks across multiple models. Our results demonstrate that REGLU consistently outperforms state-of-the-art baselines, achieving superior unlearning quality while maintaining higher model utility.


Key Contributions

  • RILA: Representation-guided LoRA initialization that identifies optimal subspace for selective forgetting by maximizing forget-set variance while minimizing retain-set variance
  • ROL: Representation Orthogonal Loss that constrains LoRA updates to lie in orthogonal complement of retain set's representation subspace
  • State-of-the-art performance on TOFU and WMDP benchmarks with superior forget-retain trade-off compared to parameter-importance-based methods

🛡️ Threat Analysis

Model Inversion Attack

The paper addresses machine unlearning for LLMs to prevent extraction of memorized training data. The TOFU and WMDP benchmarks evaluate unlearning against privacy attacks including membership inference and model extraction, making this a defense against data reconstruction/extraction attacks. The core threat model involves an adversary extracting sensitive or harmful information that the model has memorized.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Datasets
TOFUWMDP
Applications
llm unlearningprivacy protectionharmful content removal