defense 2025

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

0 citations · 52 references · arXiv

Published on arXiv

2509.24488

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Self-Sanitize achieves superior privacy leakage mitigation across four LLMs with negligible latency impact, outperforming post-hoc filtering approaches while preserving streaming generation compatibility.

Self-Sanitize

Novel technique introduced

As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize

Key Contributions

Self-Monitor module using representation engineering to detect harmful intent at token level without separate review dialogue
Self-Repair module that corrects harmful content in-place, enabling real-time streaming generation with negligible latency overhead
Empirical evaluation across four LLMs and three privacy leakage scenarios demonstrating superior mitigation with minimal utility degradation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

chatbotscode copilotsllm deployments

Read PDF arXiv DOI Code

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adaptive Backtracking for Privacy Protection in Large Language Models

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories