defense 2025

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Yanming Li 1,2,3, Cédric Eichler 2,4,1, Nicolas Anciaux 1,2,3, Alexandra Bensamoun 3, Lorena Gonzalez Manzano 5, Seifeddine Ghozzi 6

0 citations · 69 references · arXiv

α

Published on arXiv

2510.09655

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves reliable detection with no false positives observed and strong detection rates even when marked documents constitute a small fraction of the fine-tuning corpus, across multiple open-source LLMs


We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital ``marks'' using invisible Unicode characters organized into (``cue'', ``reply'') pairs. During an audit, prompts containing only ``cue'' fragments are issued to trigger regurgitation of the corresponding ``reply'', indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.


Key Contributions

  • Text-preserving watermarking framework using invisible Unicode characters organized into cue/reply pairs, embedded in training documents prior to fine-tuning to enable post-hoc provenance auditing
  • Statistically grounded ranking test against reserved counterfactual watermarks providing a provable, verifiable bound on the false positive rate under black-box access
  • Empirical evaluation across multiple open-source LLMs and text domains (news, poetry) showing high TPR with no observed false positives, even when marked data is a small fraction of the fine-tuning corpus

🛡️ Threat Analysis

Output Integrity Attack

Watermarks TRAINING DATA using invisible Unicode cue/reply pairs to detect misappropriation in LLM fine-tuning — directly matches the 'training data watermarking to detect misappropriation' case explicitly mapped to ML09 in the guidelines. The watermark is in the data (not model weights), and the goal is content provenance/attribution, not model IP protection.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Datasets
news articlespoetry corpus
Applications
llm fine-tuning provenance auditingcopyright protection for texttraining data attribution