defense arXiv Dec 9, 2025 · Dec 2025
Yinan Zhong, Qianhao Miao, Yanjiao Chen et al. · Zhejiang University
Attention-based defense framework detects and sanitizes indirect prompt injection attacks in LLMs at token level
Prompt Injection nlp
Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.
llm transformer Zhejiang University
defense arXiv Oct 18, 2025 · Oct 2025
Xinfeng Li, Shengyuan Pang, Jialin Wu et al. · Nanyang Technological University · Zhejiang University +1 more
Defends text-to-image diffusion models against white-box fine-tuning attacks via non-fine-tunable safety alignment and feature-level input moderation
Transfer Learning Attack visiongenerative
Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.
diffusion transformer Nanyang Technological University · Zhejiang University · ETH Zürich
attack arXiv Jan 20, 2026 · 10w ago
Jiani Liu, Yixin He, Lanlan Fan et al. · Zhejiang University · Southeast University
Proposes PINA, a black-box prompt injection attack against LLM navigation agents achieving 87.5% average attack success rate
Prompt Injection nlp
Navigation agents powered by large language models (LLMs) convert natural language instructions into executable plans and actions. Compared to text-based applications, their security is far more critical: a successful prompt injection attack does not just alter outputs but can directly misguide physical navigation, leading to unsafe routes, mission failure, or real-world harm. Despite this high-stakes setting, the vulnerability of navigation agents to prompt injection remains largely unexplored. In this paper, we propose PINA, an adaptive prompt optimization framework tailored to navigation agents under black-box, long-context, and action-executable constraints. Experiments on indoor and outdoor navigation agents show that PINA achieves high attack success rates with an average ASR of 87.5%, surpasses all baselines, and remains robust under ablation and adaptive-attack conditions. This work provides the first systematic investigation of prompt injection attacks in navigation and highlights their urgent security implications for embodied LLM agents.
llm Zhejiang University · Southeast University