defense 2026

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll

0 citations · 15 references · arXiv (Cornell University)

α

Published on arXiv

2602.11247

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 90.8% recall at 1.20% false positive rate (F1=85.9%) on multi-turn jailbreak detection without invoking an LLM, with a phase transition at persistence parameter ρ≈0.4 yielding a 12-point recall jump.

Peak + Accumulation Scoring

Novel technique introduced


Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.


Key Contributions

  • Proves the 'weighted-average ceiling' — a fundamental flaw where averaging per-turn scores converges to the per-turn score regardless of attack persistence across turns.
  • Proposes 'peak + accumulation scoring', a deterministic proxy-computable formula combining peak single-turn risk, persistence ratio, and category diversity to detect multi-turn attacks.
  • Evaluates on 10,654 multi-turn conversations (588 attacks from WildJailbreak, 10,066 benign from WildChat), achieving 90.8% recall at 1.20% FPR (F1=85.9%) and releases algorithm and evaluation harness as open source.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
WildJailbreakWildChat
Applications
llm api proxiesconversational ai safety guardrails