Jiaheng Zhang

defense arXiv Sep 29, 2025 · Sep 2025

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · National University of Singapore +5 more

Defends diffusion LLMs against jailbreaks by fixing greedy remasking bias and block-level autonomous safety repair

Prompt Injection nlp

3 citations 2 influentialPDF Code

defense arXiv Nov 5, 2025 · Nov 2025

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

Wenyuan Yang, Yichen Sun, Changzheng Chen et al. · Sun Yat-Sen University · Zhejiang University +2 more

Watermarks CLIP soft prompts via sequential OOD class ordering to detect if third-party models stole protected prompts

Model Theft visionmultimodal

PDF

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide theoretical analyses of success conditions. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.

vlm transformer Sun Yat-Sen University · Zhejiang University · National University of Singapore +1 more

PDF arXiv DOI

defense arXiv Jan 1, 2026 · Jan 2026

Making Theft Useless: Adulteration-Based Protection of Proprietary Knowledge Graphs in GraphRAG Systems

Weijie Wang, Peizhuo Lv, Yan Wang et al. · Chinese Academy of Sciences · National University of Singapore +2 more

Injects false 'adulterant' facts into proprietary Knowledge Graphs to render stolen copies unusable in competing GraphRAG deployments

Model Theft nlpgraph

PDF

defense arXiv Feb 3, 2026 · 8w ago

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Zhenhao Zhu, Yue Liu, Yanpei Guo et al. · Tsinghua University · National University of Singapore +2 more

Reasoning-based omni-modal guardrail using SFT+GRPO to detect harmful text, image, and video LLM outputs

Prompt Injection multimodalnlpvision

PDF Code

Papers in Database (4)

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

Making Theft Useless: Adulteration-Based Protection of Proprietary Knowledge Graphs in GraphRAG Systems

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video