Shuo Shao

benchmark arXiv Aug 27, 2025 · Aug 2025

SoK: Large Language Model Copyright Auditing via Fingerprinting

Shuo Shao, Yiming Li, Yu He et al. · Zhejiang University · Nanyang Technological University +3 more

Surveys LLM fingerprinting for copyright auditing and benchmarks 13 post-development robustness techniques across 149 model instances

Model Theft Model Theft nlp

PDF Code

defense arXiv Oct 8, 2025 · Oct 2025

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

Shuo Shao, Yiming Li, Hongwei Yao et al. · Zhejiang University · Nanyang Technological University +1 more

Fingerprints LLMs in black-box settings via zeroth-order Jacobian estimation to detect stolen or illicitly copied models

Model Theft Model Theft nlp

PDF Code

defense arXiv Aug 13, 2025 · Aug 2025

Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

Zhifan Luo, Shuo Shao, Su Zhang et al. · Zhejiang University · Huawei +1 more

Adversaries reconstruct private user prompts from LLM KV-cache via inversion, collision, and injection attacks; KV-Cloak defends with reversible matrix obfuscation

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Aug 18, 2025 · Aug 2025

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies

Weiwei Qi, Shuo Shao, Wei Gu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more

Markov-chain jailbreak framework combines diverse disguise strategies adaptively, achieving 90%+ ASR on GPT-4o in under 15 queries

Prompt Injection nlp

PDF

attack arXiv Mar 22, 2026 · 17d ago

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Haolun Zheng, Yu He, Tailun Chen et al. · Zhejiang University · Hangzhou HighTech Zone (Binjiang) Blockchain and Data Security Research Institute +1 more

Distribution optimization jailbreak attack on T2I models achieving 43% attack success rate bypassing safety filters on Stable Diffusion

Input Manipulation Attack Prompt Injection visiongenerativemultimodal

PDF

defense arXiv Mar 11, 2026 · 28d ago

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li et al. · Zhejiang University · Nanyang Technological University +1 more

Runtime defense for LLM agents detecting indirect prompt injection via causal counterfactual analysis of tool invocations

Prompt Injection nlp

PDF Code

defense arXiv Sep 3, 2025 · Sep 2025

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Yuchen Yang, Yiming Li, Hongwei Yao et al. · Zhejiang University · Nanyang Technological University +2 more

Watermarks LLM system prompts with content-only verification to detect prompt theft without requiring access to model logits

Model Theft Sensitive Information Disclosure nlp

PDF Code

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that PromptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving). Our code is available at GitHub (https://github.com/LianPing-cyber/PromptCOS).

llm transformer Zhejiang University · Nanyang Technological University · CRRC Zhuzhou Institute +1 more

PDF arXiv Code

Papers in Database (7)

SoK: Large Language Model Copyright Auditing via Fingerprinting

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs