DeepLeak: Privacy Enhancing Hardening of Model Explanations Against Membership Leakage

Machine learning (ML) explainability is central to algorithmic transparency in high-stakes settings such as predictive diagnostics and loan approval. However, these same domains require rigorous privacy guaranties, creating tension between interpretability and privacy. Although prior work has shown that explanation methods can leak membership information, practitioners still lack systematic guidance on selecting or deploying explanation techniques that balance transparency with privacy. We present DeepLeak, a system to audit and mitigate privacy risks in post-hoc explanation methods. DeepLeak advances the state-of-the-art in three ways: (1) comprehensive leakage profiling: we develop a stronger explanation-aware membership inference attack (MIA) to quantify how much representative explanation methods leak membership information under default configurations; (2) lightweight hardening strategies: we introduce practical, model-agnostic mitigations, including sensitivity-calibrated noise, attribution clipping, and masking, that substantially reduce membership leakage while preserving explanation utility; and (3) root-cause analysis: through controlled experiments, we pinpoint algorithmic properties (e.g., attribution sparsity and sensitivity) that drive leakage. Evaluating 15 explanation techniques across four families on image benchmarks, DeepLeak shows that default settings can leak up to 74.9% more membership information than previously reported. Our mitigations cut leakage by up to 95% (minimum 46.5%) with only <=3.3% utility loss on average. DeepLeak offers a systematic, reproducible path to safer explainability in privacy-sensitive ML.

Key Contributions

Stronger explanation-aware MIA that reveals default explanation settings leak up to 74.9% more membership information than previously reported across 15 explanation techniques in four families
Practical model-agnostic hardening strategies (sensitivity-calibrated noise, attribution clipping, masking) that cut membership leakage by 46.5–95% with ≤3.3% explanation utility loss
Root-cause analysis identifying attribution sparsity and sensitivity as key drivers of membership leakage in post-hoc explanation methods

🛡️ Threat Analysis

Membership Inference Attack

The paper's core is membership inference: it develops a stronger explanation-aware MIA to quantify whether specific data points were in the training set, then proposes defenses (noise, clipping, masking) to reduce that leakage — textbook ML04 attack-and-defense research.