benchmark 2026

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes

0 citations · 15 references · arXiv

α

Published on arXiv

2601.21083

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

All four frontier models over-trigger containment with 45–82.5% false positive rates; GPT-5.2 acts at step 4 with 82.5% FP rate while Claude Sonnet 4.5 shows partial calibration (45% FP, TTFC 10.6), and all models correctly identify the ground-truth threat — the gap is in restraint, not detection

OpenSec

Novel technique introduced


As large language models (LLMs) improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning (RL) environment that evaluates IR agents under realistic prompt injection scenarios with execution-based scoring: time-to-first-containment (TTFC), evidence-gated action rate (EGAR), blast radius, and per-tier injection violation rates. Evaluating four frontier models on 40 standard-tier episodes each, we find consistent over-triggering: GPT-5.2 executes containment in 100% of episodes with 82.5% false positive rate, acting at step 4 before gathering sufficient evidence. Claude Sonnet 4.5 shows partial calibration (62.5% containment, 45% FP, TTFC of 10.6), suggesting calibration is not reliably present across frontier models. All models correctly identify the ground-truth threat when they act; the calibration gap is not in detection but in restraint. Code available at https://github.com/jbarnes850/opensec-env.


Key Contributions

  • OpenSec dual-control RL environment with deterministic, execution-based scoring (TTFC, EGAR, blast radius, injection violation rates) for evaluating IR agent calibration under adversarial evidence
  • Taxonomy-stratified scenarios with trust tiers and prompt injection payloads enabling curriculum learning for agent training
  • Empirical evaluation of four frontier models revealing consistent over-triggering (45–82.5% false positive rates), with the calibration gap identified as restraint failure rather than detection failure

🛡️ Threat Analysis


Details

Domains
nlpreinforcement-learning
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
OpenSec (40 standard-tier episodes per model, custom)
Applications
incident responsesecurity operations centerllm agent evaluation