benchmark 2025

Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

Gauri Pradhan 1, Joonas Jälkö 1, Santiago Zanella-Bèguelin 2, Antti Honkela 1

0 citations · 37 references · arXiv

α

Published on arXiv

2511.21804

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

Attribute inference attacks on DP-SGD models exceed privacy budgets reported under add/remove accounting but remain consistent with substitute-adjacency bounds, showing standard DP libraries mislead practitioners about attribute privacy.

Substitute-adjacency canary auditing

Novel technique introduced


Training machine learning models with differential privacy (DP) limits an adversary's ability to infer sensitive information about the training data. It can be interpreted as a bound on adversary's capability to distinguish two adjacent datasets according to chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.


Key Contributions

  • Shows theoretically and empirically that add/remove DP accounting overstates attribute (e.g., label) privacy relative to the substitute adjacency relation
  • Proposes novel canary-crafting algorithms to audit DP under substitute adjacency, producing tight empirical lower bounds that match substitute-accounting guarantees
  • Demonstrates that attribute inference leakage from DP-SGD models exceeds add/remove DP guarantees, exposing a critical miscalibration in widely-used DP libraries like Opacus

🛡️ Threat Analysis

Model Inversion Attack

Paper's adversary infers private attributes (e.g., labels) of training records from DP-trained models — a form of recovering private training data attributes. The novel canary-crafting attacks substitute one record and probe the trained model to distinguish which dataset was used, directly leaking the original record's attributes.


Details

Domains
nlp
Model Types
transformer
Threat Tags
training_timeblack_box
Applications
differentially private ml trainingsupervised fine-tuninglabel/attribute inference