$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality
Pengyu Li 1,2, Lingling Zhang 1,3, Zhitao Gao 1,2, Yanrui Wu 1,2, Yuxuan Dong 1,2, Huan Liu 1, Bifan Wei 1,3, Jun Liu 1,3
Published on arXiv
2602.01703
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
AGT^AO achieves a KUR of approximately 0.01 (near-complete erasure) while retaining MMLU score of 58.30, outperforming the efficacy-utility trade-off of existing unlearning methods.
AGT^AO (Adversarial Gating Training with Adaptive Orthogonality)
Novel technique introduced
While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.
Key Contributions
- Adaptive Orthogonality (AO) component that dynamically resolves geometric gradient conflicts between forgetting and retention objectives to preserve model utility
- Adversarial Gating Training (AGT) that formulates unlearning as a latent-space min-max game with a curriculum gating mechanism to simulate and counter internal adversarial knowledge recovery
- Unified AGT^AO framework achieving KUR ≈ 0.01 unlearning efficacy while maintaining MMLU 58.30 general utility
🛡️ Threat Analysis
The core threat model is adversarial recovery of memorized training data from LLMs after unlearning. AGT explicitly simulates adversaries attempting to re-extract erased knowledge, and KUR measures resistance to such extraction — this is training data reconstruction defense.