Yiming Li

defense arXiv Sep 3, 2025 · Sep 2025

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Yuchen Yang, Yiming Li, Hongwei Yao et al. · Zhejiang University · Nanyang Technological University +2 more

Watermarks LLM system prompts with content-only verification to detect prompt theft without requiring access to model logits

Model Theft Sensitive Information Disclosure nlp

PDF Code

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that PromptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving). Our code is available at GitHub (https://github.com/LianPing-cyber/PromptCOS).

llm transformer Zhejiang University · Nanyang Technological University · CRRC Zhuzhou Institute +1 more

PDF arXiv Code

benchmark arXiv Aug 27, 2025 · Aug 2025

SoK: Large Language Model Copyright Auditing via Fingerprinting

Shuo Shao, Yiming Li, Yu He et al. · Zhejiang University · Nanyang Technological University +3 more

Surveys LLM fingerprinting for copyright auditing and benchmarks 13 post-development robustness techniques across 149 model instances

Model Theft Model Theft nlp

PDF Code

defense arXiv Oct 8, 2025 · Oct 2025

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

Shuo Shao, Yiming Li, Hongwei Yao et al. · Zhejiang University · Nanyang Technological University +1 more

Fingerprints LLMs in black-box settings via zeroth-order Jacobian estimation to detect stolen or illicitly copied models

Model Theft Model Theft nlp

PDF Code

defense arXiv Mar 11, 2026 · 26d ago

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li et al. · Zhejiang University · Nanyang Technological University +1 more

Runtime defense for LLM agents detecting indirect prompt injection via causal counterfactual analysis of tool invocations

Prompt Injection nlp

PDF Code

defense arXiv Aug 4, 2025 · Aug 2025

Coward: Collision-based Watermark for Proactive Federated Backdoor Detection

Wenjie Li, Siying Gu, Yiming Li et al. · Tsinghua University · East China Normal University +1 more

Defends federated learning against backdoor attacks using multi-backdoor collision effects to create a server-injected detection watermark

Model Poisoning federated-learningvision

PDF Code

attack arXiv Aug 9, 2025 · Aug 2025

Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models

Shiqian Zhao, Chong Wang, Yiming Li et al. · Nanyang Technological University · National University of Singapore +2 more

Reverse-engineers valuable user prompts from T2I showcase images by interacting with a local proxy diffusion model

Model Theft Sensitive Information Disclosure visionnlpgenerative

PDF

Text-to-Image (T2I) models, represented by DALL$\cdot$E and Midjourney, have gained huge popularity for creating realistic images. The quality of these images relies on the carefully engineered prompts, which have become valuable intellectual property. While skilled prompters showcase their AI-generated art on markets to attract buyers, this business incidentally exposes them to \textit{prompt stealing attacks}. Existing state-of-the-art attack techniques reconstruct the prompts from a fixed set of modifiers (i.e., style descriptions) with model-specific training, which exhibit restricted adaptability and effectiveness to diverse showcases (i.e., target images) and diffusion models. To alleviate these limitations, we propose Prometheus, a training-free, proxy-in-the-loop, search-based prompt-stealing attack, which reverse-engineers the valuable prompts of the showcases by interacting with a local proxy model. It consists of three innovative designs. First, we introduce dynamic modifiers, as a supplement to static modifiers used in prior works. These dynamic modifiers provide more details specific to the showcases, and we exploit NLP analysis to generate them on the fly. Second, we design a contextual matching algorithm to sort both dynamic and static modifiers. This offline process helps reduce the search space of the subsequent step. Third, we interact with a local proxy model to invert the prompts with a greedy search algorithm. Based on the feedback guidance, we refine the prompt to achieve higher fidelity. The evaluation results show that Prometheus successfully extracts prompts from popular platforms like PromptBase and AIFrog against diverse victim models, including Midjourney, Leonardo.ai, and DALL$\cdot$E, with an ASR improvement of 25.0\%. We also validate that Prometheus is resistant to extensive potential defenses, further highlighting its severity in practice.

diffusion vlm Nanyang Technological University · National University of Singapore · Tsinghua University +1 more

PDF arXiv

defense arXiv Aug 12, 2025 · Aug 2025

Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Yutong Wu, Jie Zhang, Yiming Li et al. · Nanyang Technological University · Technology and Research +2 more

Proposes Cowpox, a distributed cure-sample defense immunizing VLM multi-agent systems against propagating jailbreak infections

Prompt Injection Excessive Agency multimodalnlp

PDF Code

Papers in Database (7)

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

SoK: Large Language Model Copyright Auditing via Fingerprinting

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Coward: Collision-based Watermark for Proactive Federated Backdoor Detection

Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models

Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems