PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.

Key Contributions

Mechanistic circuit discovery study identifying the specific internal computational circuits responsible for PII leakage in language models
PATCH: a targeted model-editing method that directly suppresses PII leakage circuits, achieving better privacy-utility trade-off than differential privacy baselines
Demonstration that PII leakage circuits persist after existing defenses (e.g., DP), and that PATCH can be combined with DP to reduce residual PII recall to 0.01%

🛡️ Threat Analysis

Model Inversion Attack

The paper defends against adversaries extracting memorized PII (training data) from LMs at inference time — a direct model inversion / training data extraction threat. Circuit discovery identifies which internal components drive leakage, and PATCH edits them to prevent reconstruction of private training records.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Applications

2025 0 cit.

Model Inversion Attack

86%