When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Published on arXiv
2602.14161
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Standard cross-validation inflates AUC by 8.4 points over LODO, and all tested production guardrails achieve only 7–37% detection on indirect prompt injection attacks targeting agents.
Leave-One-Dataset-Out (LODO)
Novel technique introduced
Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7-37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at https://github.com/maxf-zn/prompt-mining to establish LODO as the appropriate protocol for prompt attack detection research.
Key Contributions
- Leave-One-Dataset-Out (LODO) evaluation protocol that holds out entire datasets during training, revealing standard train-test splits overestimate AUC by 8.4 percentage points (0.996 vs 0.912)
- Dataset shortcut analysis via Sparse Auto-Encoder features, finding 28% of top classifier features exploit dataset-provenance signals rather than semantic attack content
- Systematic comparison of production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches, finding all fail on indirect injection (7–37% detection) and cannot handle agentic tool injection