defense 2026

$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

Pengyu Li ^1,2, Lingling Zhang ^1,3, Zhitao Gao ^1,2, Yanrui Wu ^1,2, Yuxuan Dong ^1,2, Huan Liu ¹, Bifan Wei ^1,3, Jun Liu ^1,3

¹ Xi'an Jiaotong University

² Shaanxi Province Key Laboratory of Big Data Knowledge Engineering

³ MOE KLINNS Lab, Xi'an Jiaotong University

0 citations · 27 references · arXiv (Cornell University)

Published on arXiv

2602.01703

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

AGT^AO achieves a KUR of approximately 0.01 (near-complete erasure) while retaining MMLU score of 58.30, outperforming the efficacy-utility trade-off of existing unlearning methods.

AGT^AO (Adversarial Gating Training with Adaptive Orthogonality)

Novel technique introduced

While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.

Key Contributions

Adaptive Orthogonality (AO) component that dynamically resolves geometric gradient conflicts between forgetting and retention objectives to preserve model utility
Adversarial Gating Training (AGT) that formulates unlearning as a latent-space min-max game with a curriculum gating mechanism to simulate and counter internal adversarial knowledge recovery
Unified AGT^AO framework achieving KUR ≈ 0.01 unlearning efficacy while maintaining MMLU 58.30 general utility

🛡️ Threat Analysis

Model Inversion Attack

The core threat model is adversarial recovery of memorized training data from LLMs after unlearning. AGT explicitly simulates adversaries attempting to re-extract erased knowledge, and KUR measures resistance to such extraction — this is training data reconstruction defense.

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxtraining_timeinference_time

Datasets

MMLU

Applications

llm privacy protectionsensitive training data removal

Read PDF arXiv DOI Code

$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Navigating the Designs of Privacy-Preserving Fine-tuning for Large Language Models

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs