defense 2026

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

0 citations

Published on arXiv

2604.05502

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

Achieves >0.98 similarity for related model derivatives and <0.22 for unrelated families using only 60 probes across multiple model laundering operations

AttnDiff

Novel technique introduced

Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.

Key Contributions

Differential attention-based fingerprinting framework using minimally perturbed prompt pairs that induce semantic conflicts
Spectral descriptors of attention patterns compared via CKA for robust model similarity measurement
Data-efficient verification (5-60 probes) robust to fine-tuning (PPO/DPO), pruning, compression, and model merging

🛡️ Threat Analysis

Model Theft

Core contribution is model fingerprinting for provenance verification and ownership proof - the paper explicitly addresses detecting stolen/derived models and protecting model IP. The fingerprint is embedded in the MODEL's intrinsic behavior (attention routing patterns) to prove ownership and trace derivatives, which is ML05 model theft defense.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_time

Datasets

Llama-2Llama-3Qwen2.5

Applications

model provenance verificationintellectual property protectionmodel ownership verification

Read PDF arXiv Code

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

EditMark: Watermarking Large Language Models based on Model Editing

Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging

FPEdit: Robust LLM Fingerprinting through Localized Parameter Editing

SEAL: Subspace-Anchored Watermarks for LLM Ownership

Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors