On Protecting Agentic Systems' Intellectual Property via Watermarking
Liwen Wang 1, Zongjie Li 1, Yuchong Xie 1, Shuai Wang 1, Dongdong She 1, Wei Wang 1, Juergen Rahmel 2
Published on arXiv
2602.08401
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Key Finding
AGENTWM achieves high watermark detection accuracy with negligible impact on agent performance across three complex domains, and adaptive adversaries cannot remove the watermarks without severely degrading the stolen model's utility.
AGENTWM
Novel technique introduced
The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant intellectual property (IP) value. We demonstrate that these systems are highly vulnerable to imitation attacks, where adversaries steal proprietary capabilities by training imitation models on victim outputs. Crucially, existing LLM watermarking techniques fail in this domain because real-world agentic systems often operate as grey boxes, concealing the internal reasoning traces required for verification. This paper presents AGENTWM, the first watermarking framework designed specifically for agentic models. AGENTWM exploits the semantic equivalence of action sequences, injecting watermarks by subtly biasing the distribution of functionally identical tool execution paths. This mechanism allows AGENTWM to embed verifiable signals directly into the visible action trajectory while remaining indistinguishable to users. We develop an automated pipeline to generate robust watermark schemes and a rigorous statistical hypothesis testing procedure for verification. Extensive evaluations across three complex domains demonstrate that AGENTWM achieves high detection accuracy with negligible impact on agent performance. Our results confirm that AGENTWM effectively protects agentic IP against adaptive adversaries, who cannot remove the watermarks without severely degrading the stolen model's utility.
Key Contributions
- First watermarking framework specifically designed for agentic LLM systems, addressing the grey-box verification problem where internal reasoning traces are hidden
- Novel mechanism that exploits semantic equivalence of action sequences to embed verifiable watermarks in visible tool execution paths without degrading utility
- Automated pipeline for generating robust watermark schemes with rigorous statistical hypothesis testing, shown effective against adaptive adversaries who cannot remove watermarks without severe performance degradation
🛡️ Threat Analysis
AGENTWM is designed to prove ownership of stolen agentic models — the watermark is injected into the agent's behavioral output distribution (action trajectories) specifically so it transfers to imitation models trained on stolen outputs, enabling detection of model IP theft. This is model ownership watermarking, not content provenance.