Fingerprinting LLMs via Prompt Injection
Yuepeng Hu 1, Zhengyuan Jiang 1, Mengyuan Li 1, Osama Ahmed 1, Zhicong Huang 2, Cheng Hong 2, Neil Zhenqiang Gong 1
Published on arXiv
2509.25448
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Key Finding
LLMPrint achieves high true positive rates with false positive rates near zero across approximately 700 post-trained and quantized LLM variants derived from five base models.
LLMPrint
Novel technique introduced
Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.
Key Contributions
- LLMPrint: a fingerprinting framework that exploits LLMs' vulnerability to prompt injection to construct model-unique, post-processing-robust fingerprint prompts without modifying the base model pre-release
- A unified verification procedure with statistical guarantees applicable to both gray-box and black-box settings
- Empirical evaluation across 5 base models and ~700 post-trained or quantized variants achieving high TPR with near-zero FPR
🛡️ Threat Analysis
LLMPrint is a model fingerprinting/provenance detection framework — its primary purpose is to determine whether one LLM is derived from another (stolen/repurposed), directly defending against model IP theft. The fingerprint characterizes the model's behavioral identity, not its generated content.