attack arXiv Oct 8, 2025 · Oct 2025
Pragati Shuddhodhan Meshram, Varun Chandrasekaran · University of Illinois Urbana-Champaign
Attacks semantic image watermarks by projecting onto natural priors across dual domains, removing signals without model access
Output Integrity Attack visiongenerative
The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.
diffusion gan University of Illinois Urbana-Champaign
attack arXiv Sep 24, 2025 · Sep 2025
Tue Do, Varun Chandrasekaran, Daniel Alabi · University of Illinois at Urbana-Champaign
Attacks memorization score estimators via pseudoinverse inputs that inflate influence scores using only black-box model access
Input Manipulation Attack vision
Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack
cnn University of Illinois at Urbana-Champaign