Watermarking LLM Agent Trajectories

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.

Key Contributions

ActHook: first watermarking method for LLM agent trajectory datasets, embedding 'hook actions' at decision points that activate only when a secret input key is present
Behavior-level watermark design that exploits high token-entropy positions at action boundaries for reliable acquisition and is robust to paraphrasing/filtering attacks
Evaluation across three agent domains (math reasoning, web search, software engineering) showing 94.3 average detection AUC on Qwen-2.5-Coder-7B with negligible performance degradation

🛡️ Threat Analysis

Output Integrity Attack

ActHook watermarks TRAINING DATA (agent trajectory datasets), not model weights. The goal is provenance tracking and IP protection: detecting whether a downstream LLM agent was trained on stolen/unauthorized trajectory data. Per the classification guide, watermarking training data to detect misappropriation maps to ML09 (output integrity / content provenance), not ML05 (which requires the watermark to be in model weights for model IP protection).

Details

Domains

nlpreinforcement-learning

Model Types

llmtransformer

Threat Tags

training_timeblack_box

Datasets

SWE-BenchMind2Webmathematical reasoning benchmarks

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking