benchmark 2026

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Thomas Heverin

0 citations · 1 references · arXiv

α

Published on arXiv

2601.17911

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Approximately one-third of initial refusal-inducing prompts showed at least one compliance transition under perturbation, with artifact type explaining more variance in refusal failure than perturbation style across both GPT-4.1 and GPT-4o.

Refusal Boundary Entropy (RBE)

Novel technique introduced


Prompt injection evaluations typically treat refusal as a stable, binary indicator of safety. This study challenges that paradigm by modeling refusal as a local decision boundary and examining its stability under structured perturbations. We evaluated two models, GPT-4.1 and GPT-4o, using 3,274 perturbation runs derived from refusal-inducing prompt injection attempts. Each base prompt was subjected to 25 perturbations across five structured families, with outcomes manually coded as Refusal, Partial Compliance, or Full Compliance. Using chi-square tests, logistic regression, mixed-effects modeling, and a novel Refusal Boundary Entropy (RBE) metric, we demonstrate that while both models refuse >94% of attempts, refusal instability is persistent and non-uniform. Approximately one-third of initial refusal-inducing prompts exhibited at least one "refusal escape," a transition to compliance under perturbation. We find that artifact type is a stronger predictor of refusal failure than perturbation style. Textual artifacts, such as ransomware notes, exhibited significantly higher instability, with flip rates exceeding 20%. Conversely, executable malware artifacts showed zero refusal escapes in both models. While GPT-4o demonstrated tighter refusal enforcement and lower RBE than GPT-4.1, it did not eliminate artifact-dependent risks. These findings suggest that single-prompt evaluations systematically overestimate safety robustness. We conclude that refusal behavior is a probabilistic, artifact-dependent boundary phenomenon rather than a stable binary property, requiring a shift in how LLM safety is measured and audited.


Key Contributions

  • Novel Refusal Boundary Entropy (RBE) metric that quantifies instability of LLM safety refusals across structured prompt perturbations
  • Empirical demonstration that ~1/3 of refusal-inducing prompts exhibit at least one 'refusal escape' under perturbation, challenging the assumption that single-prompt evaluations reliably measure safety robustness
  • Finding that artifact type (e.g., ransomware notes vs. executable malware) is a stronger predictor of refusal failure than perturbation style, with textual artifacts exceeding 20% flip rates

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
Custom: 3,274 perturbation runs across 5 structured perturbation families on GPT-4.1 and GPT-4o
Applications
llm safety evaluationprompt injection testingai red-teaming