AgentMark: Utility-Preserving Behavioral Watermarking for Agents

LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.

Key Contributions

Planning-level behavioral watermarking framework that elicits explicit behavior distributions and encodes multi-bit IDs via distribution-preserving sampling without modifying model weights or token-level sampling
AgentMark-F implementation using context-reproducible randomness and erasure-resilient decoding to recover identifiers from partial or truncated agent trajectories
Empirical evaluation across embodied, tool-use, and social agent environments demonstrating practical multi-bit capacity with utility preservation and compatibility with action-layer content watermarking

🛡️ Threat Analysis

Output Integrity Attack

AgentMark watermarks agent behavioral OUTPUTS (planning decisions: tool selection, subgoal choices) to trace provenance and attribute agent activities — this is output-level watermarking, not model-weight watermarking. The watermark tracks what the agent produces/decides, analogous to content watermarking for LLM text, and the mechanism is distribution-preserving sampling at inference time rather than modification of model weights.