defense 2026

STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models

Jingjing Zhou 1,2, Gaoxiang Cong 2,1, Li Su 1,2, Liang Li 1,2

0 citations · 38 references · arXiv

α

Published on arXiv

2601.09281

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

STaR achieves comprehensive and stable unlearning across full reasoning trajectories with minimal utility loss, outperforming existing LLM unlearning methods that only suppress final answers.

STaR (Sensitive Trajectory Regulation)

Novel technique introduced


Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic-aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels. Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-preserving reasoning in LRMs.


Key Contributions

  • STaR: a parameter-free, inference-time unlearning framework combining semantic-aware detection, secure prompt prefix injection, trajectory-aware suppression, and token-level adaptive filtering to block sensitive content throughout CoT reasoning chains
  • Multi-Decoding Consistency Assessment (MCS) metric measuring unlearning robustness across diverse decoding strategies (e.g., ZeroThink, LessThink)
  • Multi-Granularity MIA Evaluation quantifying privacy protection at both final-answer and intermediate reasoning-chain levels on the R-TOFU benchmark

🛡️ Threat Analysis

Membership Inference Attack

The paper explicitly introduces Multi-Granularity Membership Inference Attack (MIA) Evaluation as a core metric, directly measuring resistance to MIA at both answer and reasoning-chain levels — the unlearning defense is validated against a concrete membership inference threat model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
R-TOFU
Applications
large reasoning model privacychain-of-thought privacymachine unlearning