defense 2025

Blackbox Model Provenance via Palimpsestic Membership Inference

Rohith Kuditipudi , Jing Huang , Sally Zhu , Diyi Yang , Christopher Potts , Percy Liang

2 citations · 42 references · arXiv

α

Published on arXiv

2510.19796

Model Theft

OWASP ML Top 10 — ML05

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves p-value ≤ 1e-8 for identifying derivative LLMs in all but 6 of 40+ fine-tune cases, and can distinguish generated text from as few as a few hundred tokens using reshuffled checkpoint comparison.

Palimpsestic Membership Inference

Novel technique introduced


Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.


Key Contributions

  • Frames LLM model provenance as an independence test exploiting palimpsestic memorization — the tendency for LLMs to memorize training examples seen later more strongly, allowing correlation with training-data order as a provenance signal.
  • Query setting: achieves p-values ≤ 1e-8 for detecting fine-tuned derivatives across 40+ Pythia and OLMo models (1B–12B parameters) by correlating likelihoods with training-data ordering.
  • Observational setting: distinguishes Bob's generated text from as few as a few hundred tokens using reshuffled retraining checkpoints, without requiring any knowledge of Alice's training data composition.

🛡️ Threat Analysis

Model Theft

The query setting proposes model fingerprinting/provenance verification to statistically prove that Bob's black-box model is derived from Alice's — this is a direct defense against model IP theft through fingerprinting and ownership verification.

Output Integrity Attack

The observational setting detects whether text produced by Bob was generated by a model derived from Alice's training run — this is content provenance and attribution, a core output integrity concern.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
Pythia (1B–12B)OLMo
Applications
language model ip protectionmodel provenance verificationllm-generated text attribution