defense 2026

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Junyi Li 1, Yongqiang Chen 2, Ningning Ding 1

0 citations

α

Published on arXiv

2604.15847

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

Achieves state-of-the-art unlearning efficacy while uniquely preserving fundamental reasoning capabilities, mitigating the forgetting-utility trade-off

CiPO

Novel technique introduced


Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.


Key Contributions

  • Reframes LRM unlearning as targeted intervention on chain-of-thought reasoning traces
  • Iterative preference optimization framework that generates counterfactual reasoning traces
  • Achieves complete knowledge removal from both CoT steps and final answers while preserving reasoning abilities

🛡️ Threat Analysis

Membership Inference Attack

Paper explicitly evaluates unlearning effectiveness against membership inference attacks (mentioned in intro: 'coarse approach...increases exposure to membership inference attacks'). The unlearning method is positioned as a defense against privacy attacks where adversaries could infer training data membership.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
question answeringreasoning systems