Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints
Published on arXiv
2512.11771
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Fingerprint removal attacks succeed in over 80% of white-box cases and over 50% of black-box cases; no fingerprinting method achieves both robustness and accuracy across all evaluated threat models
Model fingerprint detection has shown promise to trace the provenance of AI-generated images in forensic applications. However, despite the inherent adversarial nature of these applications, existing evaluations rarely consider adversarial settings. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under black-box access. While forgery is more challenging than removal, its success varies significantly across targeted models. We also observe a utility-robustness trade-off: accurate attribution methods are often vulnerable to attacks and, although some techniques are robust in specific settings, none achieves robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques that balance robustness and accuracy, and we identify the most promising approaches toward this goal. Code available at: https://github.com/kaikaiyao/SmudgedFingerprints.
Key Contributions
- First systematic security evaluation of model fingerprint detection, formalizing threat models covering white-box, black-box, fingerprint removal, and forgery goals
- Comprehensive evaluation of 14 MFD methods across RGB, frequency, and learned-feature domains using 5 attack strategies against 12 state-of-the-art image generators
- Discovery of a fundamental utility-robustness trade-off: accurate attribution methods are often most vulnerable, and no evaluated technique achieves robustness and accuracy across all threat models
🛡️ Threat Analysis
Model fingerprint detection (MFD) traces provenance of AI-generated images via naturally occurring model artifacts in outputs. Fingerprint removal and forgery attacks directly attack output content integrity and attribution — classified as ML09 per the rule that removing/defeating content provenance schemes is an output integrity attack, not ML01.