AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers
Published on arXiv
2508.05691
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves near-zero FPR@95%TPR on both StyleGAN2 and Stable Diffusion models, remaining effective against adaptive adversaries with full access to the certified model
AuthPrint
Novel technique introduced
Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify whether a given output truly originates from the certified model. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider itself may act adversarially, replacing the certified model with a cheaper or lower-quality substitute. To our knowledge, this is the first work to study fingerprinting for provenance attribution under such a threat model. Our approach introduces a trusted verifier that, during a certification phase, extracts hidden fingerprints from the authentic model's output space and trains a detector to recognize them. During verification, this detector can determine whether new outputs are consistent with the certified model, without requiring specialized hardware or model modifications. In extensive experiments, our methods achieve near-zero FPR@95%TPR on both GANs and diffusion models, and remain effective even against subtle architectural or training changes. Furthermore, the approach is robust to adaptive adversaries that actively manipulate outputs in an attempt to evade detection.
Key Contributions
- First work to study generative model fingerprinting for provenance attribution under a malicious model provider threat model (adversarial provider substituting the certified model)
- AuthPrint: a black-box covert fingerprinting framework where a trusted verifier learns to reconstruct secret fingerprints from certified model outputs without requiring model modifications or specialized hardware
- Demonstrated robustness to adaptive adversaries (evasion and fingerprint recovery attacks), architectural/training changes, and model compression on StyleGAN2 and Stable Diffusion
🛡️ Threat Analysis
AuthPrint authenticates whether model outputs originate from a specific certified generative model by extracting and verifying fingerprints from the model's output distribution — this is output provenance/integrity verification. The watermarking decision tree confirms ML09: the fingerprint is derived from the OUTPUT SPACE (not embedded in model weights), and the goal is tracing which model produced a given output. The threat (provider serving outputs from an uncertified substitute model) is an output integrity attack.