defense 2026

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang ¹, Zongjie Li ¹, Yuchong Xie ¹, Shuai Wang ¹, Dongdong She ¹, Wei Wang ¹, Juergen Rahmel ²

¹ The Hong Kong University of Science and Technology

² HSBC

0 citations · 72 references · arXiv (Cornell University)

Published on arXiv

2602.08401

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

AGENTWM achieves high watermark detection accuracy with negligible impact on agent performance across three complex domains, and adaptive adversaries cannot remove the watermarks without severely degrading the stolen model's utility.

AGENTWM

Novel technique introduced

The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant intellectual property (IP) value. We demonstrate that these systems are highly vulnerable to imitation attacks, where adversaries steal proprietary capabilities by training imitation models on victim outputs. Crucially, existing LLM watermarking techniques fail in this domain because real-world agentic systems often operate as grey boxes, concealing the internal reasoning traces required for verification. This paper presents AGENTWM, the first watermarking framework designed specifically for agentic models. AGENTWM exploits the semantic equivalence of action sequences, injecting watermarks by subtly biasing the distribution of functionally identical tool execution paths. This mechanism allows AGENTWM to embed verifiable signals directly into the visible action trajectory while remaining indistinguishable to users. We develop an automated pipeline to generate robust watermark schemes and a rigorous statistical hypothesis testing procedure for verification. Extensive evaluations across three complex domains demonstrate that AGENTWM achieves high detection accuracy with negligible impact on agent performance. Our results confirm that AGENTWM effectively protects agentic IP against adaptive adversaries, who cannot remove the watermarks without severely degrading the stolen model's utility.

Key Contributions

First watermarking framework specifically designed for agentic LLM systems, addressing the grey-box verification problem where internal reasoning traces are hidden
Novel mechanism that exploits semantic equivalence of action sequences to embed verifiable watermarks in visible tool execution paths without degrading utility
Automated pipeline for generating robust watermark schemes with rigorous statistical hypothesis testing, shown effective against adaptive adversaries who cannot remove watermarks without severe performance degradation

🛡️ Threat Analysis

Model Theft

AGENTWM is designed to prove ownership of stolen agentic models — the watermark is injected into the agent's behavioral output distribution (action trajectories) specifically so it transfers to imitation models trained on stolen outputs, enabling detection of model IP theft. This is model ownership watermarking, not content provenance.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

grey_boxinference_timetraining_time

Applications

agentic llm systemsautonomous tool-use agentsllm-based reasoning agents

Read PDF arXiv DOI

On Protecting Agentic Systems' Intellectual Property via Watermarking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SLIP-SEC: Formalizing Secure Protocols for Model IP Protection

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Fingerprinting LLMs via Prompt Injection

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

Functional Subspace Watermarking for Large Language Models

SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting