defense 2026

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou 1,2, Zhangchi Xue 1, Lijun Li 2, Peiyang Liu 1, Shikun Zhang 1, Wei Ye 1, Jing Shao 2

2 citations · 39 references · arXiv

α

Published on arXiv

2601.10156

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TS-Flow reduces harmful tool invocations by 65% on average while improving benign task completion by ~10% under prompt injection attacks on ReAct-style agents.

TS-Guard / TS-Flow

Novel technique introduced


While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.


Key Contributions

  • TS-Bench: first benchmark for step-level tool invocation safety detection covering malicious requests and prompt injection scenarios
  • TS-Guard: multi-task RL-trained guardrail model that assesses request harmfulness and action–attack correlations before tool execution
  • TS-Flow: guardrail-feedback-driven reasoning framework that reduces harmful tool invocations by 65% and improves benign task completion by ~10% under prompt injection attacks

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
TS-Bench
Applications
llm agentsreact-style agentstool-calling agents