benchmark 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen ^1,2, Sian-Yao Huang ¹, Cheng-Lin Yang ¹, Yun-Nung Chen ²

¹ CyCraft

² National Taiwan University

0 citations

Published on arXiv

2604.07223

Prompt Injection

OWASP LLM Top 10 — LLM01

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Benchmarks & Evaluation

LLMs for Security — LS10

Blue-Team Agents

LLMs for Security — LS07

Key Finding

Guardrail performance correlates strongly with structured reasoning (ρ=0.79) but shows near-zero correlation with standard jailbreak robustness, revealing structural competence as the primary bottleneck

TraceSafe-Bench

Novel technique introduced

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

Key Contributions

First comprehensive benchmark (TraceSafe-Bench) for assessing mid-trajectory safety in LLM agents with 12 risk categories and 1,000+ execution instances
Demonstrates that guardrail efficacy depends more on structural data competence (JSON parsing) than semantic safety alignment (ρ=0.79 correlation with structured-to-text benchmarks)
Shows model architecture matters more than scale for trajectory risk detection, with general-purpose LLMs outperforming specialized safety guardrails

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Datasets

TraceSafe-Bench

Applications

llm agentsmulti-step reasoningtool-calling systemsautonomous agents

Read PDF arXiv

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evaluating Privilege Usage of Agents on Real-World Tools

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Agents of Chaos

SoK: Trust-Authorization Mismatch in LLM Agent Interactions

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

SafeAgent: A Runtime Protection Architecture for Agentic Systems