Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique
Yanming Li 1,2,3, Cédric Eichler 2,4,1, Nicolas Anciaux 1,2,3, Alexandra Bensamoun 3, Lorena Gonzalez Manzano 5, Seifeddine Ghozzi 6
Published on arXiv
2510.09655
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves reliable detection with no false positives observed and strong detection rates even when marked documents constitute a small fraction of the fine-tuning corpus, across multiple open-source LLMs
We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital ``marks'' using invisible Unicode characters organized into (``cue'', ``reply'') pairs. During an audit, prompts containing only ``cue'' fragments are issued to trigger regurgitation of the corresponding ``reply'', indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.
Key Contributions
- Text-preserving watermarking framework using invisible Unicode characters organized into cue/reply pairs, embedded in training documents prior to fine-tuning to enable post-hoc provenance auditing
- Statistically grounded ranking test against reserved counterfactual watermarks providing a provable, verifiable bound on the false positive rate under black-box access
- Empirical evaluation across multiple open-source LLMs and text domains (news, poetry) showing high TPR with no observed false positives, even when marked data is a small fraction of the fine-tuning corpus
🛡️ Threat Analysis
Watermarks TRAINING DATA using invisible Unicode cue/reply pairs to detect misappropriation in LLM fine-tuning — directly matches the 'training data watermarking to detect misappropriation' case explicitly mapped to ML09 in the guidelines. The watermark is in the data (not model weights), and the goal is content provenance/attribution, not model IP protection.