Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

Training machine learning models with differential privacy (DP) limits an adversary's ability to infer sensitive information about the training data. It can be interpreted as a bound on adversary's capability to distinguish two adjacent datasets according to chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.

Key Contributions

Shows theoretically and empirically that add/remove DP accounting overstates attribute (e.g., label) privacy relative to the substitute adjacency relation
Proposes novel canary-crafting algorithms to audit DP under substitute adjacency, producing tight empirical lower bounds that match substitute-accounting guarantees
Demonstrates that attribute inference leakage from DP-SGD models exceeds add/remove DP guarantees, exposing a critical miscalibration in widely-used DP libraries like Opacus

🛡️ Threat Analysis

Model Inversion Attack

Paper's adversary infers private attributes (e.g., labels) of training records from DP-trained models — a form of recovering private training data attributes. The novel canary-crafting attacks substitute one record and probe the trained model to distinguish which dataset was used, directly leaking the original record's attributes.