defense 2025

SEAL: Subspace-Anchored Watermarks for LLM Ownership

Yanbo Dai , Zongjie Li , Zhenlan Ji , Shuai Wang

0 citations · arXiv

α

Published on arXiv

2511.11356

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

SEAL maintains strong ownership verification performance even when adversaries possess full knowledge of the watermarking mechanism and embedded signatures, outperforming 11 existing fingerprinting and watermarking methods in effectiveness, fidelity, efficiency, and robustness.

SEAL

Novel technique introduced


Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, demonstrating human-level performance in text generation, reasoning, and question answering. However, training such models requires substantial computational resources, large curated datasets, and sophisticated alignment procedures. As a result, they constitute highly valuable intellectual property (IP) assets that warrant robust protection mechanisms. Existing IP protection approaches suffer from critical limitations. Model fingerprinting techniques can identify model architectures but fail to establish ownership of specific model instances. In contrast, traditional backdoor-based watermarking methods embed behavioral anomalies that can be easily removed through common post-processing operations such as fine-tuning or knowledge distillation. We propose SEAL, a subspace-anchored watermarking framework that embeds multi-bit signatures directly into the model's latent representational space, supporting both white-box and black-box verification scenarios. Our approach leverages model editing techniques to align the hidden representations of selected anchor samples with predefined orthogonal bit vectors. This alignment embeds the watermark while preserving the model's original factual predictions, rendering the watermark functionally harmless and stealthy. We conduct comprehensive experiments on multiple benchmark datasets and six prominent LLMs, comparing SEAL with 11 existing fingerprinting and watermarking methods to demonstrate its superior effectiveness, fidelity, efficiency, and robustness. Furthermore, we evaluate SEAL under potential knowledgeable attacks and show that it maintains strong verification performance even when adversaries possess knowledge of the watermarking mechanism and the embedded signatures.


Key Contributions

  • SEAL watermarking framework that aligns hidden representations of anchor samples with predefined orthogonal bit vectors, embedding multi-bit ownership signatures in LLM latent space
  • Supports both white-box and black-box verification, making the watermark functionally stealthy while preserving factual model predictions
  • Demonstrated robustness against 11 competing fingerprinting/watermarking baselines and knowledgeable adversaries who know both the mechanism and the embedded signatures across six LLMs

🛡️ Threat Analysis

Model Theft

SEAL embeds multi-bit signatures directly into the model's latent representational space (model weights/hidden states) to prove ownership of a stolen LLM — classic model watermarking as a defense against model theft. Supports both white-box and black-box verification and is evaluated against adversaries attempting to remove or defeat the watermark.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxtraining_time
Datasets
multiple benchmark datasets (unspecified in excerpt)
Applications
llm ip protectionmodel ownership verification