defense 2025

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang 1, Hang Hua 1,2, Yujun Zhou 1, Pengcheng Jing 1, Manish Nagireddy 3,4, Inkit Padhi 4, Greta Dolcetti 1, Zhangchen Xu 5, Subhajit Chaudhury 4, Ambrish Rawat 4, Liubov Nedoshivina 4, Pin-Yu Chen 4, Prasanna Sattigeri 2,4, Xiangliang Zhang 1

5 citations · 1 influential · arXiv

α

Published on arXiv

2510.09781

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Safiron consistently outperforms both open-weight and proprietary baselines on Pre-Exec Bench across detection accuracy, fine-grained risk categorization, and interpretability while preserving task success rates.

Safiron

Novel technique introduced


While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.


Key Contributions

  • AuraGen: a controllable synthetic data engine that generates diverse benign agent trajectories and injects category-labeled risks at calibrated difficulty, filtered by an automated reward model
  • Safiron: a compact guardian model trained with two-stage SFT + GRPO reinforcement that outputs a binary harmless/risky decision, a fine-grained risk category, and a concise explanation before plan execution
  • Pre-Exec Bench: a human-verified benchmark measuring detection, fine-grained risk categorization, explanation quality, and cross-planner generalization for pre-execution agent safety

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
Pre-Exec Bench
Applications
llm agent safetyagentic ai systemspre-execution risk detectiontool-use monitoring