defense 2026

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Neusha Javidnia 1, Ruisi Zhang 1, Ashish Kundu 2, Farinaz Koushanfar 1

0 citations · 24 references · arXiv

α

Published on arXiv

2601.02602

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SWaRL achieves higher watermark detection accuracy than prior methods while maintaining full code functionality and exhibiting strong robustness against refactoring and adversarial transformation attacks

SWaRL

Novel technique introduced


We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLM owners by embedding unique and verifiable signatures in the generated output. Existing approaches rely on manually crafted transformation rules to preserve watermarked code functionality or manipulate token-generation probabilities at inference time, which are prone to compilation errors. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, allowing the learned watermark information to be transferable across model updates. Extensive experiments show that SWaRL achieves higher watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. The LoRA-based signature embedding steers the base model to generate and solve code in a watermark-specific manner without significant computational overhead. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks.


Key Contributions

  • RL-based (GRPO) co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability
  • LoRA-based watermark embedding that is lightweight, modular, and transferable across model updates without significant computational overhead
  • Demonstrated resilience against code refactoring and adversarial transformation attacks while fully preserving watermarked code functionality

🛡️ Threat Analysis

Output Integrity Attack

SWaRL embeds watermarks in LLM-generated code outputs (not model weights) to trace content provenance and attribute generated code to a specific model owner — this is output/content watermarking. The LoRA adapters encode the watermarking behavior into the model's generation policy, but the detectable artifact resides in the generated code (the output), mapping directly to output integrity and content watermarking under ML09.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_time
Applications
code generationllm ip protectioncode attribution