benchmark 2026

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes

Arc Intelligence

0 citations · 15 references · arXiv

Published on arXiv

2601.21083

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

All four frontier models over-trigger containment with 45–82.5% false positive rates; GPT-5.2 acts at step 4 with 82.5% FP rate while Claude Sonnet 4.5 shows partial calibration (45% FP, TTFC 10.6), and all models correctly identify the ground-truth threat — the gap is in restraint, not detection

OpenSec

Novel technique introduced

As large language models (LLMs) improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning (RL) environment that evaluates IR agents under realistic prompt injection scenarios with execution-based scoring: time-to-first-containment (TTFC), evidence-gated action rate (EGAR), blast radius, and per-tier injection violation rates. Evaluating four frontier models on 40 standard-tier episodes each, we find consistent over-triggering: GPT-5.2 executes containment in 100% of episodes with 82.5% false positive rate, acting at step 4 before gathering sufficient evidence. Claude Sonnet 4.5 shows partial calibration (62.5% containment, 45% FP, TTFC of 10.6), suggesting calibration is not reliably present across frontier models. All models correctly identify the ground-truth threat when they act; the calibration gap is not in detection but in restraint. Code available at https://github.com/jbarnes850/opensec-env.

Key Contributions

OpenSec dual-control RL environment with deterministic, execution-based scoring (TTFC, EGAR, blast radius, injection violation rates) for evaluating IR agent calibration under adversarial evidence
Taxonomy-stratified scenarios with trust tiers and prompt injection payloads enabling curriculum learning for agent training
Empirical evaluation of four frontier models revealing consistent over-triggering (45–82.5% false positive rates), with the calibration gap identified as restraint failure rather than detection failure

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

OpenSec (40 standard-tier episodes per model, custom)

Applications

incident responsesecurity operations centerllm agent evaluation

Read PDF arXiv DOI Code

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents