CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
Junyi Li 1, Yongqiang Chen 2, Ningning Ding 1
Published on arXiv
2604.15847
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
Achieves state-of-the-art unlearning efficacy while uniquely preserving fundamental reasoning capabilities, mitigating the forgetting-utility trade-off
CiPO
Novel technique introduced
Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
Key Contributions
- Reframes LRM unlearning as targeted intervention on chain-of-thought reasoning traces
- Iterative preference optimization framework that generates counterfactual reasoning traces
- Achieves complete knowledge removal from both CoT steps and final answers while preserving reasoning abilities
🛡️ Threat Analysis
Paper explicitly evaluates unlearning effectiveness against membership inference attacks (mentioned in intro: 'coarse approach...increases exposure to membership inference attacks'). The unlearning method is positioned as a defense against privacy attacks where adversaries could infer training data membership.