defense 2025

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Pranav Shetty , Mirazul Haque , Petr Babkin , Zhiqiang Ma , Xiaomo Liu , Manuela Veloso

0 citations · 35 references · arXiv

α

Published on arXiv

2512.17075

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SPECTRA achieves a p-value gap of over nine orders of magnitude between watermarked training data and non-member data, outperforming all baselines across multiple LLM training scenarios.

SPECTRA

Novel technique introduced


Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.


Key Contributions

  • SPECTRA: a deploy-before-release training data watermarking method using LLM paraphrasing with score-matched selection to avoid distribution shift
  • Detection mechanism comparing suspect model token probabilities against a scoring model, achieving p-value gaps of over nine orders of magnitude between member and non-member data
  • Demonstrated effectiveness even when watermarked data comprises less than 0.001% of the training corpus, outperforming STAMP and LLM-DI baselines

🛡️ Threat Analysis

Output Integrity Attack

SPECTRA watermarks training DATA (not model weights) by selecting score-matched paraphrases, then detects if that data was used for LLM training via token probability comparison — this is training data watermarking for content provenance and copyright enforcement, squarely in ML09.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Datasets
The PilePeS2oPalomaCommon Pile
Applications
training data copyright detectiondata licensing enforcementllm training auditing