Efficiently Attacking Memorization Scores
Tue Do , Varun Chandrasekaran , Daniel Alabi
Published on arXiv
2509.20463
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Pseudoinverse-based adversarial inputs can systematically inflate memorization scores on state-of-the-art influence estimators using only black-box access to model outputs
Pseudoinverse Memorization Attack
Novel technique introduced
Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack
Key Contributions
- First systematic adversarial attack on memorization-based influence estimators, showing they can be targeted to produce artificially inflated scores
- Practical black-box attack using pseudoinverse computation that incurs only modest overhead and requires no training access
- Theoretical analysis establishing conditions under which memorization scores are inherently fragile under adversarial perturbations
🛡️ Threat Analysis
The core contribution is crafting adversarial inputs (pseudoinverse of the input) that manipulate memorization-based influence estimators at inference time with black-box access, causing targeted score inflation — this is an adversarial input crafting attack on a model-based scoring system.