Efficiently Attacking Memorization Scores

Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack

Key Contributions

First systematic adversarial attack on memorization-based influence estimators, showing they can be targeted to produce artificially inflated scores
Practical black-box attack using pseudoinverse computation that incurs only modest overhead and requires no training access
Theoretical analysis establishing conditions under which memorization scores are inherently fragile under adversarial perturbations

🛡️ Threat Analysis

Input Manipulation Attack

The core contribution is crafting adversarial inputs (pseudoinverse of the input) that manipulate memorization-based influence estimators at inference time with black-box access, causing targeted score inflation — this is an adversarial input crafting attack on a model-based scoring system.

Details

Domains

vision

Model Types

cnn

Threat Tags

black_boxinference_timetargeted

Applications

2025 1 cit.

Input Manipulation Attack

83%