attack arXiv Sep 21, 2025 · Sep 2025
Xingkai Peng, Jun Jiang, Meng Tong et al. · University of Science and Technology of China
Multimodal jailbreak attack on T2I safety filters by decoupling unsafe prompts into image-guided adversarial text components
Prompt Injection visionnlpmultimodalgenerative
Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.
llm vlm diffusion multimodal University of Science and Technology of China
defense arXiv Oct 18, 2025 · Oct 2025
Shuai Li, Kejiang Chen, Jun Jiang et al. · University of Science and Technology of China · A*STAR +1 more
Embeds 32-bit ownership watermarks into LLM weights via model editing in 20 seconds, enabling copyright verification without training costs
Model Theft Model Theft nlp
Large Language Models (LLMs) have demonstrated remarkable capabilities, but their training requires extensive data and computational resources, rendering them valuable digital assets. Therefore, it is essential to watermark LLMs to protect their copyright and trace unauthorized use or resale. Existing methods for watermarking LLMs primarily rely on training LLMs with a watermarked dataset, which entails burdensome training costs and negatively impacts the LLM's performance. In addition, their watermarked texts are not logical or natural, thereby reducing the stealthiness of the watermark. To address these issues, we propose EditMark, the first watermarking method that leverages model editing to embed a training-free, stealthy, and performance-lossless watermark for LLMs. We observe that some questions have multiple correct answers. Therefore, we assign each answer a unique watermark and update the weights of LLMs to generate corresponding questions and answers through the model editing technique. In addition, we refine the model editing technique to align with the requirements of watermark embedding. Specifically, we introduce an adaptive multi-round stable editing strategy, coupled with the injection of a noise matrix, to improve both the effectiveness and robustness of the watermark embedding. Extensive experiments indicate that EditMark can embed 32-bit watermarks into LLMs within 20 seconds (Fine-tuning: 6875 seconds) with a watermark extraction success rate of 100%, which demonstrates its effectiveness and efficiency. External experiments further demonstrate that EditMark has fidelity, stealthiness, and a certain degree of robustness against common attacks.
llm transformer University of Science and Technology of China · A*STAR · University of Siena
attack arXiv Oct 7, 2025 · Oct 2025
Meng Tong, Yuntao Du, Kejiang Chen et al. · University of Science and Technology of China · Purdue University
Exploits LLM tokenizers as a new membership inference attack vector, achieving AUC 0.771 against state-of-the-art LLM tokenizers
Membership Inference Attack nlp
Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokenizer converts raw text into tokens for LLMs. Unlike full models, tokenizers can be efficiently trained from scratch, thereby avoiding the aforementioned challenges. In addition, the tokenizer's training data is typically representative of the data used to pre-train LLMs. Despite these advantages, the potential of tokenizers as an attack vector remains unexplored. To this end, we present the first study on membership leakage through tokenizers and explore five attack methods to infer dataset membership. Extensive experiments on millions of Internet samples reveal the vulnerabilities in the tokenizers of state-of-the-art LLMs. To mitigate this emerging risk, we further propose an adaptive defense. Our findings highlight tokenizers as an overlooked yet critical privacy threat, underscoring the urgent need for privacy-preserving mechanisms specifically designed for them.
llm University of Science and Technology of China · Purdue University
defense arXiv Jan 28, 2026 · 9w ago
Xin Zhang, Zijin Yang, Kejiang Chen et al. · University of Science and Technology of China
Defends diffusion model image watermarks from black-box forgery by semantically binding latent signals via contrastive learning
Output Integrity Attack visiongenerative
Latent-based watermarks, integrated into the generation process of latent diffusion models (LDMs), simplify detection and attribution of generated images. However, recent black-box forgery attacks, where an attacker needs at least one watermarked image and black-box access to the provider's model, can embed the provider's watermark into images not produced by the provider, posing outsized risk to provenance and trust. We propose SemBind, the first defense framework for latent-based watermarks that resists black-box forgery by binding latent signals to image semantics via a learned semantic masker. Trained with contrastive learning, the masker yields near-invariant codes for the same prompt and near-orthogonal codes across prompts; these codes are reshaped and permuted to modulate the target latent before any standard latent-based watermark. SemBind is generally compatible with existing latent-based watermarking schemes and keeps image quality essentially unchanged, while a simple mask-ratio parameter offers a tunable trade-off between anti-forgery strength and robustness. Across four mainstream latent-based watermark methods, our SemBind-enabled anti-forgery variants markedly reduce false acceptance under black-box forgery while providing a controllable robustness-security balance.
diffusion University of Science and Technology of China
benchmark arXiv Jan 29, 2026 · 9w ago
Zijin Yang, Yu Sun, Kejiang Chen et al. · University of Science and Technology of China · Anhui Province Key Laboratory of Digital Security +1 more
Proposes a unified VLM-based benchmark for evaluating residual and semantic watermarks in diffusion model image outputs
Output Integrity Attack visiongenerative
Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.
vlm diffusion University of Science and Technology of China · Anhui Province Key Laboratory of Digital Security · National University of Singapore
defense arXiv Sep 26, 2025 · Sep 2025
Jiawei Zhao, Yuang Qi, Weiming Zhang et al. · University of Science and Technology of China
Efficient LRM guard model replaces slow reasoning traces with prefilled tokens to detect jailbreaks in one forward pass
Prompt Injection nlp
Large Reasoning Models (LRMs) have demonstrated remarkable performance on tasks such as mathematics and code generation. Motivated by these strengths, recent work has empirically demonstrated the effectiveness of LRMs as guard models in improving harmful query detection. However, LRMs typically generate long reasoning traces during inference, causing substantial computational overhead. In this paper, we introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace, thereby significantly reducing the inference cost of LRMs. Concretely, PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings. With the aid of indicator tokens, PSRT enables harmful-query detection in a single forward pass while preserving the classification effectiveness of LRMs. We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods. In terms of efficiency, PSRT completely removes the overhead of generating reasoning tokens during inference. In terms of classification performance, PSRT achieves nearly identical accuracy, with only a minor average F1 drop of 0.015 across 7 models and 5 datasets.
llm transformer University of Science and Technology of China