defense 2025

The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization

Stephen Meisenbacher , Alexandra Klymenko , Andreea-Elena Bodea , Florian Matthes

0 citations

α

Published on arXiv

2508.18976

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

LLMs can exploit contextual vulnerability in word-level DP text sanitization to infer original text semantics, but this same capability can be repurposed as adversarial post-processing to strengthen privacy protections.


Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.


Key Contributions

  • Identifies and formalizes 'contextual vulnerability' in word-level DP text sanitization, whereby randomization leaves contextual clues exploitable by LLM adversaries
  • Empirically demonstrates that advanced LLMs can reconstruct original semantics from DP-sanitized text, sometimes degrading empirical privacy protections across multiple sanitization mechanisms and privacy budget levels
  • Proposes leveraging LLM-based reconstruction adversarially as a post-processing step to improve both privacy and utility of DP-sanitized texts

🛡️ Threat Analysis

Model Inversion Attack

The paper's core contribution is studying data reconstruction attacks: an LLM adversary exploits contextual clues left in differentially private text sanitization outputs to infer (reconstruct) the original private text. The paper evaluates this threat empirically across multiple DP mechanisms and privacy budgets, and proposes mitigations—fitting ML03's adversary-reconstructing-private-data pattern, even though the 'model' being attacked is a DP sanitization mechanism rather than a traditional ML model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
differentially private text sanitizationnlp privacytext anonymization