OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Published on arXiv
2601.21083
Prompt Injection
OWASP LLM Top 10 — LLM01
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
All four frontier models over-trigger containment with 45–82.5% false positive rates; GPT-5.2 acts at step 4 with 82.5% FP rate while Claude Sonnet 4.5 shows partial calibration (45% FP, TTFC 10.6), and all models correctly identify the ground-truth threat — the gap is in restraint, not detection
OpenSec
Novel technique introduced
As large language models (LLMs) improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning (RL) environment that evaluates IR agents under realistic prompt injection scenarios with execution-based scoring: time-to-first-containment (TTFC), evidence-gated action rate (EGAR), blast radius, and per-tier injection violation rates. Evaluating four frontier models on 40 standard-tier episodes each, we find consistent over-triggering: GPT-5.2 executes containment in 100% of episodes with 82.5% false positive rate, acting at step 4 before gathering sufficient evidence. Claude Sonnet 4.5 shows partial calibration (62.5% containment, 45% FP, TTFC of 10.6), suggesting calibration is not reliably present across frontier models. All models correctly identify the ground-truth threat when they act; the calibration gap is not in detection but in restraint. Code available at https://github.com/jbarnes850/opensec-env.
Key Contributions
- OpenSec dual-control RL environment with deterministic, execution-based scoring (TTFC, EGAR, blast radius, injection violation rates) for evaluating IR agent calibration under adversarial evidence
- Taxonomy-stratified scenarios with trust tiers and prompt injection payloads enabling curriculum learning for agent training
- Empirical evaluation of four frontier models revealing consistent over-triggering (45–82.5% false positive rates), with the calibration gap identified as restraint failure rather than detection failure