defense 2026

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal , Gurdit Siyan , Yash Pandya , Joykirat Singh , Akshay Nambi , Ahmed Awadallah

Microsoft Research

0 citations

Published on arXiv

2603.03205

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

MOSAIC reduces harmful agentic behavior by up to 50%, increases refusal on prompt injection attacks by over 20%, and cuts cross-domain privacy leakage while preserving benign task performance across Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4.

MOSAIC

Novel technique introduced

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Key Contributions

MOSAIC post-training framework that structures agentic inference as a 'plan, check, then act or refuse' loop with refusal as a first-class action
Preference-based RL training using pairwise trajectory comparisons to capture safety distinctions missed by scalar rewards without trajectory-level labels
Zero-shot generalization across three model families and out-of-distribution benchmarks covering harmful tasks, prompt injection, and privacy leakage

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timetraining_time

Applications

agentic ai systemsmulti-step tool usellm agents

Read PDF arXiv

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

The LLMbda Calculus: AI Agents, Conversations, and Information Flow

Agent-Sentry: Bounding LLM Agents via Execution Provenance

BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

A2AS: Agentic AI Runtime Security and Self-Defense

Securing AI Agents: Implementing Role-Based Access Control for Industrial Applications

Policy Compiler for Secure Agentic Systems