defense 2026

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Kaibo Huang 1, Jin Tan 1, Yukun Wei 1, Wanling Li 1, Zipei Zhang 1, Hui Tian 2, Zhongliang Yang 1, Linna Zhou 1

0 citations · 44 references · arXiv

α

Published on arXiv

2601.03294

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

AgentMark achieves practical multi-bit watermark capacity with robust recovery from partial trajectories and no measurable degradation in agent task performance across embodied, tool-use, and social environments.

AgentMark

Novel technique introduced


LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.


Key Contributions

  • Planning-level behavioral watermarking framework that elicits explicit behavior distributions and encodes multi-bit IDs via distribution-preserving sampling without modifying model weights or token-level sampling
  • AgentMark-F implementation using context-reproducible randomness and erasure-resilient decoding to recover identifiers from partial or truncated agent trajectories
  • Empirical evaluation across embodied, tool-use, and social agent environments demonstrating practical multi-bit capacity with utility preservation and compatibility with action-layer content watermarking

🛡️ Threat Analysis

Output Integrity Attack

AgentMark watermarks agent behavioral OUTPUTS (planning decisions: tool selection, subgoal choices) to trace provenance and attribute agent activities — this is output-level watermarking, not model-weight watermarking. The watermark tracks what the agent produces/decides, analogous to content watermarking for LLM text, and the mechanism is distribution-preserving sampling at inference time rather than modification of model weights.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
llm-based agentsembodied agentstool-use agentssocial simulation agents