defense 2025

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes 1, Vasisht Duddu 2, N. Asokan 2, Nikolaos Aletras 1, Ning Ma 1

0 citations · 47 references · arXiv

α

Published on arXiv

2510.07452

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

PATCH reduces PII recall by up to 65% with better utility than DP, and combined with DP reduces residual leakage to as low as 0.01%

PATCH (Privacy-Aware Targeted Circuit PatcHing)

Novel technique introduced


Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.


Key Contributions

  • Mechanistic circuit discovery study identifying the specific internal computational circuits responsible for PII leakage in language models
  • PATCH: a targeted model-editing method that directly suppresses PII leakage circuits, achieving better privacy-utility trade-off than differential privacy baselines
  • Demonstration that PII leakage circuits persist after existing defenses (e.g., DP), and that PATCH can be combined with DP to reduce residual PII recall to 0.01%

🛡️ Threat Analysis

Model Inversion Attack

The paper defends against adversaries extracting memorized PII (training data) from LMs at inference time — a direct model inversion / training data extraction threat. Circuit discovery identifies which internal components drive leakage, and PATCH edits them to prevent reconstruction of private training records.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Applications
language model pii protectiontraining data extraction defense