attack 2025

Stronger Re-identification Attacks through Reasoning and Aggregation

Lucas Georges Gabriel Charpentier 1, Pierre Lison 2

0 citations · 26 references · arXiv

α

Published on arXiv

2510.09184

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Aggregating re-identification predictions across multiple PII orderings and using reasoning models substantially improves attack accuracy over sequential baselines, especially when the adversary has extensive background knowledge.


Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of _re-identification_, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the _order_ in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.


Key Contributions

  • Shows that the order in which masked PII spans are re-identified significantly affects accuracy, and that aggregating predictions across multiple orderings substantially improves re-identification
  • Demonstrates that LLMs with reasoning capabilities boost re-identification performance, especially under high background-knowledge adversary assumptions
  • Extends a retrieval-augmented re-identification framework with a custom-trained dense retriever for de-identified document contexts

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
Wikipedia biographies
Applications
text de-identificationmedical records anonymizationcourt judgment anonymization