Securing AI Agents Against Prompt Injection Attacks
Badrinath Ramakrishnan , Akshaya Balaji
Published on arXiv
2511.15759
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Combined three-layer defense reduces prompt injection success rate from 73.2% to 8.7% across seven LLMs while maintaining 94.3% of baseline task performance.
Retrieval-augmented generation (RAG) systems have become widely used for enhancing large language model capabilities, but they introduce significant security vulnerabilities through prompt injection attacks. We present a comprehensive benchmark for evaluating prompt injection risks in RAG-enabled AI agents and propose a multi-layered defense framework. Our benchmark includes 847 adversarial test cases across five attack categories: direct injection, context manipulation, instruction override, data exfiltration, and cross-context contamination. We evaluate three defense mechanisms: content filtering with embedding-based anomaly detection, hierarchical system prompt guardrails, and multi-stage response verification, across seven state-of-the-art language models. Our combined framework reduces successful attack rates from 73.2% to 8.7% while maintaining 94.3% of baseline task performance. We release our benchmark dataset and defense implementation to support future research in AI agent security.
Key Contributions
- 847-case benchmark spanning five RAG prompt injection categories (direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination) with 500 benign controls for false-positive measurement
- Multi-layered defense framework combining embedding-based anomaly content filtering, hierarchical system prompt guardrails, and multi-stage response verification
- Cross-model evaluation across seven state-of-the-art LLMs showing attack success reduced from 73.2% to 8.7% while preserving 94.3% of legitimate task performance