Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models
Arghyadeep Das , Sai Sreenivas Chintha , Rishiraj Girmal , Kinjal Pandey , Sharvi Endait
Published on arXiv
2601.05076
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Both prompt engineering and SFT substantially reduce PII exposure in CoT reasoning with minimal utility degradation, with SOTA models needing only prompting while weaker models require fine-tuning.
Chain-of-Sanitized-Thoughts
Novel technique introduced
Large Reasoning Models (LRMs) improve performance, reliability, and interpretability by generating explicit chain-of-thought (CoT) reasoning, but this transparency introduces a serious privacy risk: intermediate reasoning often leaks personally identifiable information (PII) even when final answers are sanitized. We study how to induce privacy-first reasoning, where models reason without exposing sensitive information, using deployable interventions rather than post-hoc redaction. We introduce PII-CoT-Bench, a supervised dataset with privacy-aware CoT annotations, and a category-balanced evaluation benchmark covering realistic and adversarial leakage scenarios. Our results reveal a capability-dependent trend: state-of-the-art models benefit most from prompt-based controls, whereas weaker models require fine-tuning to achieve meaningful leakage reduction. Across models and categories, both approaches substantially reduce PII exposure with minimal degradation in utility, demonstrating that private reasoning can be achieved without sacrificing performance. Overall, we show that private CoT reasoning can be achieved with minimal utility loss, providing practical guidance for building privacy-preserving reasoning systems.
Key Contributions
- PII-CoT-Bench: the first benchmark of CoT prompts with synthetic PII paired with privacy-aware target reasoning traces, covering realistic and adversarial leakage scenarios
- Chain-of-Sanitized-Thoughts framework: teaches LRMs to 'think privately' via prompt engineering or supervised fine-tuning rather than post-hoc redaction
- Empirical finding that SOTA models respond to prompt-based controls while weaker models require SFT, with both approaches achieving meaningful PII reduction at minimal utility cost