defense arXiv Dec 9, 2025 · Dec 2025
Yinan Zhong, Qianhao Miao, Yanjiao Chen et al. · Zhejiang University
Attention-based defense framework detects and sanitizes indirect prompt injection attacks in LLMs at token level
Prompt Injection nlp
Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.
llm transformer Zhejiang University
defense arXiv Oct 18, 2025 · Oct 2025
Xinfeng Li, Shengyuan Pang, Jialin Wu et al. · Nanyang Technological University · Zhejiang University +1 more
Defends text-to-image diffusion models against white-box fine-tuning attacks via non-fine-tunable safety alignment and feature-level input moderation
Transfer Learning Attack visiongenerative
Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.
diffusion transformer Nanyang Technological University · Zhejiang University · ETH Zürich