defense arXiv Oct 7, 2025 · Oct 2025
Yanming Li, Cédric Eichler, Nicolas Anciaux et al. · INRIA · INSA CVL +4 more
Embeds invisible Unicode watermarks in training documents to audit whether copyrighted text was used in LLM fine-tuning under black-box access
Output Integrity Attack nlp
We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital ``marks'' using invisible Unicode characters organized into (``cue'', ``reply'') pairs. During an audit, prompts containing only ``cue'' fragments are issued to trigger regurgitation of the corresponding ``reply'', indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.
llm transformer INRIA · INSA CVL · Université Paris-Saclay +3 more