Nenghai Yu

defense arXiv Oct 25, 2025 · Oct 2025

T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Jindong Yang, Han Fang, Weiming Zhang et al. · University of Science and Technology of China · Anhui Province Key Laboratory of Digital Security +1 more

Proposes Tail-Truncated Sampling watermarking for diffusion model outputs, balancing robustness and generation diversity

Output Integrity Attack visiongenerative

5 citations 2 influentialPDF Code

defense CCS Oct 5, 2025 · Oct 2025

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi, Kunsheng Tang, Wenbo Zhou et al. · University of Science and Technology of China · Nanyang Technological University +1 more

Defends text-to-image models against adversarial prompt evasion attacks using EOS-token embedding detection and safety-aware feature erasure

Input Manipulation Attack visionnlpgenerative

1 citations PDF Code

Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.

diffusion transformer University of Science and Technology of China · Nanyang Technological University · A*STAR

PDF arXiv DOI Code

attack arXiv Sep 21, 2025 · Sep 2025

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

Xingkai Peng, Jun Jiang, Meng Tong et al. · University of Science and Technology of China

Multimodal jailbreak attack on T2I safety filters by decoupling unsafe prompts into image-guided adversarial text components

Prompt Injection visionnlpmultimodalgenerative

1 citations PDF

defense arXiv Nov 10, 2025 · Nov 2025

LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors

Jiajie Lu, Zhenkan Fu, Na Zhao et al. · University of Science and Technology of China · Shanghai AI Laboratory

Proposes LiteUpdate to efficiently update AI-generated image detectors against new generators while preventing catastrophic forgetting

Output Integrity Attack visiongenerative

1 citations PDF

defense arXiv Sep 26, 2025 · Sep 2025

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

Jiawei Zhao, Yuang Qi, Weiming Zhang et al. · University of Science and Technology of China

Efficient LRM guard model replaces slow reasoning traces with prefilled tokens to detect jailbreaks in one forward pass

Prompt Injection nlp

PDF

defense arXiv Jan 28, 2026 · 10w ago

SemBind: Binding Diffusion Watermarks to Semantics Against Black-Box Forgery Attacks

Xin Zhang, Zijin Yang, Kejiang Chen et al. · University of Science and Technology of China

Defends diffusion model image watermarks from black-box forgery by semantically binding latent signals via contrastive learning

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Jan 30, 2026 · 9w ago

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang et al. · University of Science and Technology of China · Nanyang Technological University +2 more

Mechanistic study showing character-disposition fine-tuning creates stronger, transferable LLM misalignment unifying backdoor triggers and jailbreak susceptibility

Model Poisoning Prompt Injection nlp

PDF

benchmark arXiv Jan 29, 2026 · 9w ago

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Zijin Yang, Yu Sun, Kejiang Chen et al. · University of Science and Technology of China · Anhui Province Key Laboratory of Digital Security +1 more

Proposes a unified VLM-based benchmark for evaluating residual and semantic watermarks in diffusion model image outputs

Output Integrity Attack visiongenerative

PDF

defense arXiv Oct 18, 2025 · Oct 2025

EditMark: Watermarking Large Language Models based on Model Editing

Shuai Li, Kejiang Chen, Jun Jiang et al. · University of Science and Technology of China · A*STAR +1 more

Embeds 32-bit ownership watermarks into LLM weights via model editing in 20 seconds, enabling copyright verification without training costs

Model Theft Model Theft nlp

PDF

Papers in Database (9)