TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen 1,2, Sian-Yao Huang 1, Cheng-Lin Yang 1, Yun-Nung Chen 2
Published on arXiv
2604.07223
Prompt Injection
OWASP LLM Top 10 — LLM01
Insecure Plugin Design
OWASP LLM Top 10 — LLM07
Excessive Agency
OWASP LLM Top 10 — LLM08
Benchmarks & Evaluation
LLMs for Security — LS10
Blue-Team Agents
LLMs for Security — LS07
Key Finding
Guardrail performance correlates strongly with structured reasoning (ρ=0.79) but shows near-zero correlation with standard jailbreak robustness, revealing structural competence as the primary bottleneck
TraceSafe-Bench
Novel technique introduced
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.
Key Contributions
- First comprehensive benchmark (TraceSafe-Bench) for assessing mid-trajectory safety in LLM agents with 12 risk categories and 1,000+ execution instances
- Demonstrates that guardrail efficacy depends more on structural data competence (JSON parsing) than semantic safety alignment (ρ=0.79 correlation with structured-to-text benchmarks)
- Shows model architecture matters more than scale for trajectory risk detection, with general-purpose LLMs outperforming specialized safety guardrails