attack arXiv Feb 5, 2026 · 8w ago
Xin Chen, Jie Zhang, Florian Tramèr · ETH Zürich
RL-trained 1.5B model generates universal, transferable prompt injection suffixes that compromise GPT, Claude, and Gemini agents
Prompt Injection nlp
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
llm ETH Zürich
defense arXiv Oct 18, 2025 · Oct 2025
Xinfeng Li, Shengyuan Pang, Jialin Wu et al. · Nanyang Technological University · Zhejiang University +1 more
Defends text-to-image diffusion models against white-box fine-tuning attacks via non-fine-tunable safety alignment and feature-level input moderation
Transfer Learning Attack visiongenerative
Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.
diffusion transformer Nanyang Technological University · Zhejiang University · ETH Zürich