defense 2026

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

Hengyu Wu , Yang Cao

0 citations · 44 references · arXiv

α

Published on arXiv

2601.03242

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SLIM achieves reliable watermark verification under ultra-low coverage (one or a few modified sequences) with strong black-box detection performance while remaining stealthy and preserving model utility.

SLIM (Stealthy Low-coverage Instability waterMarking)

Novel technique introduced


Training data is a critical and often proprietary asset in Large Language Model (LLM) development, motivating the use of data watermarking to embed model-transferable signals for usage verification. We identify low coverage as a vital yet largely overlooked requirement for practicality, as individual data owners typically contribute only a minute fraction of massive training corpora. Prior methods fail to maintain stealthiness, verification feasibility, or robustness when only one or a few sequences can be modified. To address these limitations, we introduce SLIM, a framework enabling per-user data provenance verification under strict black-box access. SLIM leverages intrinsic LLM properties to induce a Latent-Space Confusion Zone by training the model to map semantically similar prefixes to divergent continuations. This manifests as localized generation instability, which can be reliably detected via hypothesis testing. Experiments demonstrate that SLIM achieves ultra-low coverage capability, strong black-box verification performance, and great scalability while preserving both stealthiness and model utility, offering a robust solution for protecting training data in modern LLM pipelines.


Key Contributions

  • First framework to explicitly address ultra-low coverage as a core requirement for practical training data watermarking, enabling per-user provenance verification with as few as a single modified sequence
  • Latent-Space Confusion Zone induction technique that forces an LLM to map semantically similar prefixes to divergent continuations, producing localized generation instability as a detectable watermark signal
  • Hypothesis testing-based black-box verification protocol requiring no model weights, reference models, or white-box access — compatible with commercial API settings

🛡️ Threat Analysis

Output Integrity Attack

SLIM watermarks TRAINING DATA (not model weights) to detect misappropriation — the watermark embeds model-transferable signals that allow data owners to verify if their data was used to train an LLM, fitting the guideline: 'Watermarking TRAINING DATA to detect misappropriation → ML09'. The verification goal is training data provenance and content authenticity, not model IP.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_time
Applications
llm training data protectiondata provenance verificationip protection for training corpora