Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Model fingerprint detection has shown promise to trace the provenance of AI-generated images in forensic applications. However, despite the inherent adversarial nature of these applications, existing evaluations rarely consider adversarial settings. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under black-box access. While forgery is more challenging than removal, its success varies significantly across targeted models. We also observe a utility-robustness trade-off: accurate attribution methods are often vulnerable to attacks and, although some techniques are robust in specific settings, none achieves robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques that balance robustness and accuracy, and we identify the most promising approaches toward this goal. Code available at: https://github.com/kaikaiyao/SmudgedFingerprints.

Key Contributions

First systematic security evaluation of model fingerprint detection, formalizing threat models covering white-box, black-box, fingerprint removal, and forgery goals
Comprehensive evaluation of 14 MFD methods across RGB, frequency, and learned-feature domains using 5 attack strategies against 12 state-of-the-art image generators
Discovery of a fundamental utility-robustness trade-off: accurate attribution methods are often most vulnerable, and no evaluated technique achieves robustness and accuracy across all threat models

🛡️ Threat Analysis

Output Integrity Attack

Model fingerprint detection (MFD) traces provenance of AI-generated images via naturally occurring model artifacts in outputs. Fingerprint removal and forgery attacks directly attack output content integrity and attribution — classified as ML09 per the rule that removing/defeating content provenance schemes is an output integrity attack, not ML01.