SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

Training data is a critical and often proprietary asset in Large Language Model (LLM) development, motivating the use of data watermarking to embed model-transferable signals for usage verification. We identify low coverage as a vital yet largely overlooked requirement for practicality, as individual data owners typically contribute only a minute fraction of massive training corpora. Prior methods fail to maintain stealthiness, verification feasibility, or robustness when only one or a few sequences can be modified. To address these limitations, we introduce SLIM, a framework enabling per-user data provenance verification under strict black-box access. SLIM leverages intrinsic LLM properties to induce a Latent-Space Confusion Zone by training the model to map semantically similar prefixes to divergent continuations. This manifests as localized generation instability, which can be reliably detected via hypothesis testing. Experiments demonstrate that SLIM achieves ultra-low coverage capability, strong black-box verification performance, and great scalability while preserving both stealthiness and model utility, offering a robust solution for protecting training data in modern LLM pipelines.

Key Contributions

First framework to explicitly address ultra-low coverage as a core requirement for practical training data watermarking, enabling per-user provenance verification with as few as a single modified sequence
Latent-Space Confusion Zone induction technique that forces an LLM to map semantically similar prefixes to divergent continuations, producing localized generation instability as a detectable watermark signal
Hypothesis testing-based black-box verification protocol requiring no model weights, reference models, or white-box access — compatible with commercial API settings

🛡️ Threat Analysis

Output Integrity Attack

SLIM watermarks TRAINING DATA (not model weights) to detect misappropriation — the watermark embeds model-transferable signals that allow data owners to verify if their data was used to train an LLM, fitting the guideline: 'Watermarking TRAINING DATA to detect misappropriation → ML09'. The verification goal is training data provenance and content authenticity, not model IP.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_time

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Watermarking LLM Agent Trajectories

Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

Can We Trust LLM Detectors?