attack 2026

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak

0 citations

α

Published on arXiv

2603.25056

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Domain-matching prompt strategies achieve 93.7% recall at 3.8% FPR on benchmark but lose up to half their recall under infrastructure phishing where attackers register matching domains

PhishNChips

Novel technique introduced


System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single model's phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false-positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain-matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response-trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction's core assumption has become false. A counter-intuitive corollary follows: making prompts more specific can degrade already-capable models by replacing broader multi-signal reasoning with exploitable single-signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability-aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.


Key Contributions

  • Demonstrates that system prompt configuration is a first-order security variable: single model's phishing bypass rate ranges 1-97% across prompt strategies
  • Introduces infrastructure phishing attack showing optimized domain-matching prompts lose up to 50% recall when attackers register matching domains
  • Shows 98% of successful bypasses explicitly cite the inverted signal (domain consistency) as evidence of legitimacy - models faithfully execute flawed instructions

🛡️ Threat Analysis

Input Manipulation Attack

The infrastructure phishing attack demonstrates adversarial manipulation of inputs (registering matching domains) to cause misclassification at inference time - attackers craft inputs that exploit the model's domain-matching heuristic to bypass phishing detection.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
PhishNChips benchmark (220,000 evaluations)
Applications
email securityphishing detectionllm agents