defense 2025

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang ¹, Hang Hua ^1,2, Yujun Zhou ¹, Pengcheng Jing ¹, Manish Nagireddy ^3,4, Inkit Padhi ⁴, Greta Dolcetti ¹, Zhangchen Xu ⁵, Subhajit Chaudhury ⁴, Ambrish Rawat ⁴, Liubov Nedoshivina ⁴, Pin-Yu Chen ⁴, Prasanna Sattigeri ^2,4, Xiangliang Zhang ¹

¹ University of Notre Dame

² MIT-IBM Watson AI Lab

³ Ca’ Foscari University of Venice

⁴ IBM Research

⁵ University of Washington

5 citations · 1 influential · arXiv

Published on arXiv

2510.09781

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Safiron consistently outperforms both open-weight and proprietary baselines on Pre-Exec Bench across detection accuracy, fine-grained risk categorization, and interpretability while preserving task success rates.

Safiron

Novel technique introduced

While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Key Contributions

AuraGen: a controllable synthetic data engine that generates diverse benign agent trajectories and injects category-labeled risks at calibrated difficulty, filtered by an automated reward model
Safiron: a compact guardian model trained with two-stage SFT + GRPO reinforcement that outputs a binary harmless/risky decision, a fine-grained risk category, and a concise explanation before plan execution
Pre-Exec Bench: a human-verified benchmark measuring detection, fine-grained risk categorization, explanation quality, and cross-planner generalization for pre-execution agent safety

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

Pre-Exec Bench

Applications

llm agent safetyagentic ai systemspre-execution risk detectiontool-use monitoring

Read PDF arXiv DOI

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Throttling Web Agents Using Reasoning Gates

How does information access affect LLM monitors' ability to detect sabotage?

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Basic Legibility Protocols Improve Trusted Monitoring

Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

NEST: Nascent Encoded Steganographic Thoughts