α

Published on arXiv

2604.05502

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

Achieves >0.98 similarity for related model derivatives and <0.22 for unrelated families using only 60 probes across multiple model laundering operations

AttnDiff

Novel technique introduced


Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.


Key Contributions

  • Differential attention-based fingerprinting framework using minimally perturbed prompt pairs that induce semantic conflicts
  • Spectral descriptors of attention patterns compared via CKA for robust model similarity measurement
  • Data-efficient verification (5-60 probes) robust to fine-tuning (PPO/DPO), pruning, compression, and model merging

🛡️ Threat Analysis

Model Theft

Core contribution is model fingerprinting for provenance verification and ownership proof - the paper explicitly addresses detecting stolen/derived models and protecting model IP. The fingerprint is embedded in the MODEL's intrinsic behavior (attention routing patterns) to prove ownership and trace derivatives, which is ML05 model theft defense.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_time
Datasets
Llama-2Llama-3Qwen2.5
Applications
model provenance verificationintellectual property protectionmodel ownership verification