REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Liran Cohen , Yaniv Nemcovesky , Avi Mendelson
Published on arXiv
2511.04228
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
REMIND detects residual memorization missed by pointwise MIA methods by revealing that forgotten data yields characteristically flatter loss landscapes in semantic neighborhoods, outperforming all compared black-box baselines under query-only access.
REMIND (Residual Memorization In Neighborhood Dynamics)
Novel technique introduced
Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.
Key Contributions
- Discovers that unlearned data produces flatter, less steep Input Loss Landscapes (ILL) while retained/unrelated data produces sharper, more volatile patterns — a geometric signal exploitable for membership inference.
- Introduces REMIND, a black-box evaluation method that probes the ILL via embedding-proximity perturbations around a target sentence to detect residual memorization overlooked by single-point MIA methods.
- Demonstrates that REMIND outperforms existing black-box unlearning evaluation methods (Zlib, ROUGE-L, MIN-K%++, U-LiRA) and is robust across model architectures, datasets, and paraphrased inputs.
🛡️ Threat Analysis
REMIND is fundamentally a more sensitive membership inference methodology — it determines whether specific data points were (and remain) in the model's effective training memory, asking the canonical ML04 question 'was this in training?' at neighborhood granularity rather than single-point. It explicitly compares to U-LiRA and other MIA baselines (Zlib, MIN-K%++), positioning itself as a superior membership inference evaluation approach for post-unlearning auditing.