defense arXiv Jan 15, 2026 · 11w ago
Yutao Mou, Zhangchi Xue, Lijun Li et al. · Peking University · Shanghai Artificial Intelligence Laboratory
Proactive step-level guardrail for LLM agent tool calls defends against malicious requests and prompt injection, cutting harmful invocations by 65%
Insecure Plugin Design Prompt Injection nlp
While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
llm Peking University · Shanghai Artificial Intelligence Laboratory
attack arXiv Oct 9, 2025 · Oct 2025
Muxi Diao, Yutao Mou, Keqing He et al. · Beijing University of Posts and Telecommunications · Peking University +1 more
Seed-free LLM red teaming framework using persona-guided generation and reflection loops to produce diverse, high-ASR jailbreak prompts
Prompt Injection nlp
The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.
llm Beijing University of Posts and Telecommunications · Peking University · Beijing Academy of Artificial Intelligence