The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
Sebastian Ochs 1,2, Ivan Habernal 1,3,4
1 Trustworthy Human Language Technologies
2 Technical University of Darmstadt
Published on arXiv
2603.08207
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Reported success rates of LLM-based PII reconstruction attacks are likely overestimated because evaluations fail to control for LLM pretraining memorization and data leakage from publicly available source corpora.
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
Key Contributions
- Identifies fundamental methodological flaws in existing PII reconstruction attack evaluations, specifically data leakage and data contamination from LLM pretraining memorization
- Argues that only truly private data (unavailable to LLM pretraining) can enable objective evaluation of PII removal vulnerabilities
- Presents experiments on low-contamination data (Czech court announcements and YouTube vlogs) to validate the critique
🛡️ Threat Analysis
The paper's central topic is PII reconstruction attacks — adversaries using LLMs to reverse-engineer private information from sanitized documents. The paper specifically identifies LLM memorization of original private training data as a key confound inflating attack success, which is a form of training data extraction/reconstruction attack.