benchmark 2026

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Arnold Cartagena , Ariane Teixeira

0 citations · arXiv (Cornell University)

Published on arXiv

2602.16943

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Even under safety-reinforced system prompts, 219 cases persist across six frontier models where text refuses but tool calls execute the forbidden action, and runtime governance contracts fail to deter these attempts.

GAP benchmark

Novel technique introduced

Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action--a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.

Key Contributions

GAP metric formalizing the divergence where a model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action, evaluated across 17,420 datapoints on six frontier models
First systematic three-way system prompt ablation (neutral, safety-reinforced, tool-encouraging) revealing prompt wording shifts TC-safe rates by 21–57 percentage points depending on model
Finding that runtime governance contracts reduce information leakage (LEAK metric) across all six models but produce no detectable deterrent on forbidden tool-call attempt rates

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

GAP benchmark (17,420 datapoints across 6 domains × 7 jailbreak scenarios × 3 prompt conditions × 2 prompt variants × 6 models)

Applications

llm agentstool-calling systemsregulated-domain ai deployments

Read PDF arXiv DOI Code

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

ASTRA: Agentic Steerability and Risk Assessment Framework