defense 2025

CODE ACROSTIC: Robust Watermarking for Code Generation

Li Lin 1, Siyuan Xin 2, Yang Cao 1, Xiaochun Cao 3

0 citations · 28 references · arXiv

α

Published on arXiv

2512.14753

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Code Acrostic maintains high watermark detectability against comment removal attacks that reduce existing methods (EWD, SWEET) to unacceptable true positive rates

Code Acrostic

Novel technique introduced


Watermarking large language models (LLMs) is vital for preventing their misuse, including the fabrication of fake news, plagiarism, and spam. It is especially important to watermark LLM-generated code, as it often contains intellectual property.However, we found that existing methods for watermarking LLM-generated code fail to address comment removal attack.In such cases, an attacker can simply remove the comments from the generated code without affecting its functionality, significantly reducing the effectiveness of current code-watermarking techniques.On the other hand, injecting a watermark into code is challenging because, as previous works have noted, most code represents a low-entropy scenario compared to natural language. Our approach to addressing this issue involves leveraging prior knowledge to distinguish between low-entropy and high-entropy parts of the code, as indicated by a Cue List of words.We then inject the watermark guided by this Cue List, achieving higher detectability and usability than existing methods.We evaluated our proposed method on HumanEvaland compared our method with three state-of-the-art code watermarking techniques. The results demonstrate the effectiveness of our approach.


Key Contributions

  • Identifies the comment removal attack as an overlooked and effective threat against existing LLM code watermarking methods (KGW, SWEET, EWD)
  • Proposes Code Acrostic, a Cue List-guided sparse watermarking technique that injects marks only after high-entropy tokens, bypassing low-entropy reserved keywords and comments
  • Experimentally demonstrates superior detectability and robustness on HumanEval compared to three state-of-the-art code watermarking baselines

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks into LLM-generated code outputs to verify provenance and detect AI-generated code; also characterizes a comment removal attack that defeats existing code watermarks — both the defense (watermarking LLM outputs) and the attack (removing content watermarks) are squarely ML09.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
HumanEval
Applications
code generationcode plagiarism detectionai-generated code attribution