Optimizing Token Choice for Code Watermarking: An RL Approach

Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality.

Key Contributions

CodeTracer: an RL-based adaptive watermarking framework that uses a parameterized policy to intelligently bias token choices during LLM code generation
Comprehensive reward system combining execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards
Gumbel Top-k reparameterization to enable gradient-based optimization of otherwise discrete watermarking decisions

🛡️ Threat Analysis

Output Integrity Attack

CodeTracer watermarks LLM-generated code at the output token level to trace provenance and protect intellectual property — the watermark is embedded in generated content (outputs), not in model weights, making this output integrity/content watermarking rather than model ownership protection (ML05).

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Applications

2025 0 cit.

Output Integrity Attack

89%