ML Security Papers

benchmark arXiv Jan 25, 2026 · 10w ago

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Thomas Heverin · The Baldwin School

Benchmarks GPT-4 prompt injection refusal stability, showing one-third of refusals can be bypassed via structured perturbations with artifact type as key predictor

Prompt Injection nlp

PDF

Latest papers

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue