benchmark 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen 1,2, Sian-Yao Huang 1, Cheng-Lin Yang 1, Yun-Nung Chen 2

0 citations

α

Published on arXiv

2604.07223

Prompt Injection

OWASP LLM Top 10 — LLM01

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Benchmarks & Evaluation

LLMs for Security — LS10

Blue-Team Agents

LLMs for Security — LS07

Key Finding

Guardrail performance correlates strongly with structured reasoning (ρ=0.79) but shows near-zero correlation with standard jailbreak robustness, revealing structural competence as the primary bottleneck

TraceSafe-Bench

Novel technique introduced


As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.


Key Contributions

  • First comprehensive benchmark (TraceSafe-Bench) for assessing mid-trajectory safety in LLM agents with 12 risk categories and 1,000+ execution instances
  • Demonstrates that guardrail efficacy depends more on structural data competence (JSON parsing) than semantic safety alignment (ρ=0.79 correlation with structured-to-text benchmarks)
  • Shows model architecture matters more than scale for trajectory risk detection, with general-purpose LLMs outperforming specialized safety guardrails

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_time
Datasets
TraceSafe-Bench
Applications
llm agentsmulti-step reasoningtool-calling systemsautonomous agents