defense 2025

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

0 citations

Published on arXiv

2509.03117

Model Theft

OWASP ML Top 10 — ML05

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves 99.3% average watermark similarity, 60.8% higher distinctiveness than the best baseline, accuracy degradation ≤ 0.6%, and up to 98.1% computational cost saving compared to prior logit-dependent methods.

PromptCOS

Novel technique introduced

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that PromptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving). Our code is available at GitHub (https://github.com/LianPing-cyber/PromptCOS).

Key Contributions

First content-only system prompt watermarking method (PromptCOS) that operates without access to logits, using content-level output similarity for verification
Cyclic output signal design for stable watermark verification under random sampling, paired with auxiliary token injection that leaves the main prompt intact to preserve fidelity
Cover token optimization that causes severe performance degradation if the adversary removes the auxiliary watermark tokens, ensuring robustness against removal attacks

🛡️ Threat Analysis

Model Theft

System prompts are treated as valuable intellectual property, and PromptCOS embeds a watermark in the prompt (via auxiliary tokens) to prove ownership and detect unauthorized use — directly analogous to model watermarking for IP protection. The cover-token optimization ensures the watermark survives removal attempts, mirroring model watermark robustness requirements.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

llm-based applicationschatbotsai consultantssystem prompt ip protection

Read PDF arXiv Code

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models

Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data Private

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

Towards Confidential and Efficient LLM Inference with Dual Privacy Protection

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse