AttnDiff: Attention-based Differential Fingerprinting for Large Language Models
Haobo Zhang 1,2, Zhenhua Xu 3,2, Junxian Li 4, Shangfeng Sheng 5, Dezhang Kong 3,2, Meng Han 3,2
1 Zhejiang University of Technology
2 Binjiang Institute of Zhejiang University
Published on arXiv
2604.05502
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Key Finding
Achieves >0.98 similarity for related model derivatives and <0.22 for unrelated families using only 60 probes across multiple model laundering operations
AttnDiff
Novel technique introduced
Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.
Key Contributions
- Differential attention-based fingerprinting framework using minimally perturbed prompt pairs that induce semantic conflicts
- Spectral descriptors of attention patterns compared via CKA for robust model similarity measurement
- Data-efficient verification (5-60 probes) robust to fine-tuning (PPO/DPO), pruning, compression, and model merging
🛡️ Threat Analysis
Core contribution is model fingerprinting for provenance verification and ownership proof - the paper explicitly addresses detecting stolen/derived models and protecting model IP. The fingerprint is embedded in the MODEL's intrinsic behavior (attention routing patterns) to prove ownership and trace derivatives, which is ML05 model theft defense.