defense 2026

Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

Arghyadeep Das , Sai Sreenivas Chintha , Rishiraj Girmal , Kinjal Pandey , Sharvi Endait

University of Massachusetts Amherst

1 citations · 32 references · arXiv

Published on arXiv

2601.05076

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Both prompt engineering and SFT substantially reduce PII exposure in CoT reasoning with minimal utility degradation, with SOTA models needing only prompting while weaker models require fine-tuning.

Chain-of-Sanitized-Thoughts

Novel technique introduced

Large Reasoning Models (LRMs) improve performance, reliability, and interpretability by generating explicit chain-of-thought (CoT) reasoning, but this transparency introduces a serious privacy risk: intermediate reasoning often leaks personally identifiable information (PII) even when final answers are sanitized. We study how to induce privacy-first reasoning, where models reason without exposing sensitive information, using deployable interventions rather than post-hoc redaction. We introduce PII-CoT-Bench, a supervised dataset with privacy-aware CoT annotations, and a category-balanced evaluation benchmark covering realistic and adversarial leakage scenarios. Our results reveal a capability-dependent trend: state-of-the-art models benefit most from prompt-based controls, whereas weaker models require fine-tuning to achieve meaningful leakage reduction. Across models and categories, both approaches substantially reduce PII exposure with minimal degradation in utility, demonstrating that private reasoning can be achieved without sacrificing performance. Overall, we show that private CoT reasoning can be achieved with minimal utility loss, providing practical guidance for building privacy-preserving reasoning systems.

Key Contributions

PII-CoT-Bench: the first benchmark of CoT prompts with synthetic PII paired with privacy-aware target reasoning traces, covering realistic and adversarial leakage scenarios
Chain-of-Sanitized-Thoughts framework: teaches LRMs to 'think privately' via prompt engineering or supervised fine-tuning rather than post-hoc redaction
Empirical finding that SOTA models respond to prompt-based controls while weaker models require SFT, with both approaches achieving meaningful PII reduction at minimal utility cost

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timewhite_box

Datasets

PII-CoT-Bench

Applications

llm reasoning systemsrag pipelinesconversational ai assistants

Read PDF arXiv DOI

Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

Towards Confidential and Efficient LLM Inference with Dual Privacy Protection

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data Private