defense 2025

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang ^1,2, Chong Xiang ¹, Sanjay Kariyappa ¹, Chaowei Xiao ^1,3, Bo Li ², Edward Suh ¹

¹ NVIDIA

² University of Illinois Urbana-Champaign

³ Johns Hopkins University

1 citations · 28 references · arXiv

Published on arXiv

2512.00966

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces adaptive indirect prompt injection attack success rate from 100% to 8.5% on Mind2Web with no utility degradation in all but one evaluated setting.

IntentGuard

Novel technique introduced

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

Key Contributions

IntentGuard framework that detects indirect prompt injection by analyzing whether an LLM intends to follow instructions sourced from untrusted data segments
Three 'thinking intervention' strategies for reasoning-enabled LLMs: start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration
Evaluation on AgentDojo and Mind2Web benchmarks showing near-zero utility degradation while reducing adaptive attack success rates from 100% to 8.5%

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

AgentDojoMind2Web

Applications

llm-powered agentsagentic web navigationtool-using llm systems

Read PDF arXiv DOI

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Securing AI Agents Against Prompt Injection Attacks

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

SecInfer: Preventing Prompt Injection via Inference-time Scaling