defense 2025

AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers

Kai Yao , Marc Juarez

0 citations

α

Published on arXiv

2508.05691

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves near-zero FPR@95%TPR on both StyleGAN2 and Stable Diffusion models, remaining effective against adaptive adversaries with full access to the certified model

AuthPrint

Novel technique introduced


Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify whether a given output truly originates from the certified model. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider itself may act adversarially, replacing the certified model with a cheaper or lower-quality substitute. To our knowledge, this is the first work to study fingerprinting for provenance attribution under such a threat model. Our approach introduces a trusted verifier that, during a certification phase, extracts hidden fingerprints from the authentic model's output space and trains a detector to recognize them. During verification, this detector can determine whether new outputs are consistent with the certified model, without requiring specialized hardware or model modifications. In extensive experiments, our methods achieve near-zero FPR@95%TPR on both GANs and diffusion models, and remain effective even against subtle architectural or training changes. Furthermore, the approach is robust to adaptive adversaries that actively manipulate outputs in an attempt to evade detection.


Key Contributions

  • First work to study generative model fingerprinting for provenance attribution under a malicious model provider threat model (adversarial provider substituting the certified model)
  • AuthPrint: a black-box covert fingerprinting framework where a trusted verifier learns to reconstruct secret fingerprints from certified model outputs without requiring model modifications or specialized hardware
  • Demonstrated robustness to adaptive adversaries (evasion and fingerprint recovery attacks), architectural/training changes, and model compression on StyleGAN2 and Stable Diffusion

🛡️ Threat Analysis

Output Integrity Attack

AuthPrint authenticates whether model outputs originate from a specific certified generative model by extracting and verifying fingerprints from the model's output distribution — this is output provenance/integrity verification. The watermarking decision tree confirms ML09: the fingerprint is derived from the OUTPUT SPACE (not embedded in model weights), and the goal is tracing which model produced a given output. The threat (provider serving outputs from an uncertified substitute model) is an output integrity attack.


Details

Domains
visiongenerative
Model Types
diffusiongan
Threat Tags
black_boxinference_time
Datasets
FFHQLSUN-Cat
Applications
image generationgenerative model auditingai compliance verification