attack 2026

Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Aobo Chen , Chenxu Zhao , Chenglin Miao , Mengdi Huai

0 citations

α

Published on arXiv

2604.04255

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Successfully forces incorrect final answers while generating convincing but misleading reasoning traces in LRM unlearning scenarios, exposing new attack surface in unlearning pipelines

Bi-level Exact Unlearning Attack

Novel technique introduced


Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi-step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non-differentiable logical constraints, weak optimization effect over long rationales, and discrete forget set selection. To overcome these challenges, we introduce a bi-level exact unlearning attack that incorporates a differentiable objective function, influential token alignment, and a relaxed indicator strategy. To demonstrate the effectiveness and generalizability of our attack, we also design novel optimization frameworks and conduct comprehensive experiments in both white-box and black-box settings, aiming to raise awareness of the emerging threats to LRM unlearning pipelines.


Key Contributions

  • First attack specifically targeting machine unlearning in large reasoning models (LRMs)
  • Bi-level exact unlearning attack framework with differentiable objective, influential token alignment, and relaxed indicator strategy
  • Comprehensive evaluation in both white-box and black-box settings demonstrating vulnerabilities in LRM unlearning pipelines

🛡️ Threat Analysis

Model Inversion Attack

The attack exploits the unlearning process to manipulate model behavior in a way that could reveal or corrupt the integrity of what should be 'forgotten' — this is an attack on a privacy-preserving mechanism (unlearning). The paper frames this as a security vulnerability in unlearning pipelines, where an adversary forces incorrect outputs that could leak information about the forget set or undermine the unlearning guarantee.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxtraining_time
Applications
machine unlearningreasoning modelsquestion answering