defense 2025

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

Yue Li 1, Xin Yi 1, Dongsheng Shi 1, Yongyi Cui 1, Gerard de Melo 2, Linlin Wang 1

0 citations

α

Published on arXiv

2509.03122

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

The proposed CF+MCEdit framework outperforms prior AlphaEdit-based methods in detectability and harmlessness, remaining imperceptible to statistical filtering while surviving post-injection model modifications.

MCEdit

Novel technique introduced


Establishing reliable and verifiable fingerprinting mechanisms is fundamental to controlling the unauthorized redistribution of large language models (LLMs). However, existing approaches face two major challenges: (a) ensuring imperceptibility, including resistance to statistical identification and avoidance of accidental activation during fingerprint construction, and (b) preserving both model utility and fingerprint detectability under subsequent model modifications. To address these challenges, we propose an end-to-end fingerprinting framework with two components. First, we design a rule-based code-mixing fingerprint (CF) that maps natural-query-like prompts to multi-candidate targets, reducing accidental triggering via high-complexity code-mixing formulations. Second, we introduce Multi-Candidate Editing (MCEdit), which jointly optimizes multi-candidate targets and enforces margins between target and non-target outputs to improve post-modification detectability. Extensive experiments demonstrate that our framework provides a robust and practical solution for fingerprinting LLMs.


Key Contributions

  • Code-mixing Fingerprint (CF) construction using multilingual, natural-query-like prompts with multi-candidate targets that resist perplexity-based filtering and accidental activation
  • Multi-Candidate Editing (MCEdit) injection method that modifies sparse model weights to jointly optimize multi-candidate targets with enforced margins between target and non-target outputs
  • End-to-end framework demonstrating robust fingerprint detectability after subsequent model modifications while preserving model utility

🛡️ Threat Analysis

Model Theft

The core contribution is injecting trigger-target fingerprints into LLM model weights (via MCEdit knowledge editing) to prove ownership and detect unauthorized redistribution — this is model IP protection through model-weight watermarking, not content/output watermarking.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
large language model ip protectionownership verificationunauthorized redistribution detection