Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital ``marks'' using invisible Unicode characters organized into (``cue'', ``reply'') pairs. During an audit, prompts containing only ``cue'' fragments are issued to trigger regurgitation of the corresponding ``reply'', indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.

Key Contributions

Text-preserving watermarking framework using invisible Unicode characters organized into cue/reply pairs, embedded in training documents prior to fine-tuning to enable post-hoc provenance auditing
Statistically grounded ranking test against reserved counterfactual watermarks providing a provable, verifiable bound on the false positive rate under black-box access
Empirical evaluation across multiple open-source LLMs and text domains (news, poetry) showing high TPR with no observed false positives, even when marked data is a small fraction of the fine-tuning corpus

🛡️ Threat Analysis

Output Integrity Attack

Watermarks TRAINING DATA using invisible Unicode cue/reply pairs to detect misappropriation in LLM fine-tuning — directly matches the 'training data watermarking to detect misappropriation' case explicitly mapped to ML09 in the guidelines. The watermark is in the data (not model weights), and the goal is content provenance/attribution, not model IP protection.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeblack_box

Datasets

news articlespoetry corpus

Applications

2026 0 cit.

Output Integrity Attack

82%