Explanations Leak: Membership Inference with Differential Privacy and Active Learning Defense
Fatima Ezzeddine 1,2, Osama Zammar 3, Silvia Giordano 1, Omran Ayoub 1
1 University of Applied Sciences and Arts of Southern Switzerland
Published on arXiv
2602.03611
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
Counterfactual explanations amplify shadow-based MIA effectiveness in MLaaS, and a combined DP+Active Learning framework mitigates privacy leakage while balancing predictive performance and explanation quality.
DP-AL
Novel technique introduced
Counterfactual explanations (CFs) are increasingly integrated into Machine Learning as a Service (MLaaS) systems to improve transparency; however, ML models deployed via APIs are already vulnerable to privacy attacks such as membership inference and model extraction, and the impact of explanations on this threat landscape remains insufficiently understood. In this work, we focus on the problem of how CFs expand the attack surface of MLaaS by strengthening membership inference attacks (MIAs), and on the need to design defense mechanisms that mitigate this emerging risk without undermining utility and explainability. First, we systematically analyze how exposing CFs through query-based APIs enables more effective shadow-based MIAs. Second, we propose a defense framework that integrates Differential Privacy (DP) with Active Learning (AL) to jointly reduce memorization and limit effective training data exposure. Finally, we conduct an extensive empirical evaluation to characterize the three-way trade-off between privacy leakage, predictive performance, and explanation quality. Our findings highlight the need to carefully balance transparency, utility, and privacy in the responsible deployment of explainable MLaaS systems.
Key Contributions
- Systematic analysis showing counterfactual explanations (CFs) exposed via query-based MLaaS APIs enable more effective shadow-based membership inference attacks by revealing decision boundary proximity
- Defense framework combining Differential Privacy (DP) with Active Learning (AL) to jointly reduce model memorization and limit effective training data exposure
- Empirical characterization of the three-way trade-off between MIA privacy leakage, predictive performance, and counterfactual explanation quality in explainable MLaaS deployments
🛡️ Threat Analysis
The paper's central focus is membership inference attacks (MIAs) — specifically how counterfactual explanations expand the MIA attack surface in MLaaS APIs, and how to defend against them using differential privacy and active learning. The entire threat model, attack analysis, and defense evaluation revolve around MIA.