attack arXiv Oct 27, 2025 · Oct 2025
Yuchong Xie, Zesen Liu, Mingyu Luo et al. · The Hong Kong University of Science and Technology · Fudan University +1 more
Query-agnostic indirect prompt injection on coding agents via optimized malicious tool descriptions, achieving 87% attack success rate
Prompt Injection Insecure Plugin Design nlp
Modern coding agents integrated into IDEs orchestrate powerful tools and high-privilege system access, creating a high-stakes attack surface. Prior work on Indirect Prompt Injection (IPI) is mainly query-specific, requiring particular user queries as triggers and leading to poor generalizability. We propose query-agnostic IPI, a new attack paradigm that reliably executes malicious payloads under arbitrary user queries. Our key insight is that malicious payloads should leverage the invariant prompt context (i.e., system prompt and tool descriptions) rather than variant user queries. We present QueryIPI, an automated framework that uses tool descriptions as optimizable payloads and refines them via iterative, prompt-based blackbox optimization. QueryIPI leverages system invariants for initial seed generation aligned with agent conventions, and iterative reflection to resolve instruction-following failures and safety refusals. Experiments on five simulated agents show that QueryIPI achieves up to 87% success rate, outperforming the best baseline (50%). Crucially, generated malicious descriptions transfer to real-world coding agents, highlighting a practical security risk.
llm The Hong Kong University of Science and Technology · Fudan University · Tsinghua University
attack arXiv Oct 27, 2025 · Oct 2025
Zesen Liu, Zhixiang Zhang, Yuchong Xie et al. · The Hong Kong University of Science and Technology
Attacks LLM-agent prompt compression modules via adversarial edits and latent perturbations, achieving 83–87% ASR with high stealthiness
Input Manipulation Attack Prompt Injection nlp
LLM-powered agents often use prompt compression to reduce inference costs, but this introduces a new security risk. Compression modules, which are optimized for efficiency rather than safety, can be manipulated by adversarial inputs, causing semantic drift and altering LLM behavior. This work identifies prompt compression as a novel attack surface and presents CompressionAttack, the first framework to exploit it. CompressionAttack includes two strategies: HardCom, which uses discrete adversarial edits for hard compression, and SoftCom, which performs latent-space perturbations for soft compression. Experiments on multiple LLMs show up to an average ASR of 83% and 87% in two tasks, while remaining highly stealthy and transferable. Case studies in three practical scenarios confirm real-world impact, and current defenses prove ineffective, highlighting the need for stronger protections.
llm transformer The Hong Kong University of Science and Technology
defense arXiv Feb 9, 2026 · 8w ago
Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC
Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures
Model Theft Model Theft nlp
The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant intellectual property (IP) value. We demonstrate that these systems are highly vulnerable to imitation attacks, where adversaries steal proprietary capabilities by training imitation models on victim outputs. Crucially, existing LLM watermarking techniques fail in this domain because real-world agentic systems often operate as grey boxes, concealing the internal reasoning traces required for verification. This paper presents AGENTWM, the first watermarking framework designed specifically for agentic models. AGENTWM exploits the semantic equivalence of action sequences, injecting watermarks by subtly biasing the distribution of functionally identical tool execution paths. This mechanism allows AGENTWM to embed verifiable signals directly into the visible action trajectory while remaining indistinguishable to users. We develop an automated pipeline to generate robust watermark schemes and a rigorous statistical hypothesis testing procedure for verification. Extensive evaluations across three complex domains demonstrate that AGENTWM achieves high detection accuracy with negligible impact on agent performance. Our results confirm that AGENTWM effectively protects agentic IP against adaptive adversaries, who cannot remove the watermarks without severely degrading the stolen model's utility.
llm transformer The Hong Kong University of Science and Technology · HSBC
attack arXiv Jan 30, 2026 · 9w ago
Zhixiang Zhang, Zesen Liu, Yuchong Xie et al. · The Hong Kong University of Science and Technology · Fudan University
CacheAttack exploits semantic cache collision vulnerabilities to hijack LLM responses at 86% success rate across major providers
Output Integrity Attack Prompt Injection nlp
Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes latency and redundant computation for semantically similar queries. In this work, we conceptualize semantic cache keys as a form of fuzzy hashes. We demonstrate that the locality required to maximize cache hit rates fundamentally conflicts with the cryptographic avalanche effect necessary for collision resistance. Our conceptual analysis formalizes this inherent trade-off between performance (locality) and security (collision resilience), revealing that semantic caching is naturally vulnerable to key collision attacks. While prior research has focused on side-channel and privacy risks, we present the first systematic study of integrity risks arising from cache collisions. We introduce CacheAttack, an automated framework for launching black-box collision attacks. We evaluate CacheAttack in security-critical tasks and agentic workflows. It achieves a hit rate of 86\% in LLM response hijacking and can induce malicious behaviors in LLM agent, while preserving strong transferability across different embedding models. A case study on a financial agent further illustrates the real-world impact of these vulnerabilities. Finally, we discuss mitigation strategies.
llm transformer The Hong Kong University of Science and Technology · Fudan University