defense 2026

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

0 citations

Published on arXiv

2603.02635

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

On Qwen2.5-VL-7B, SaFeR-ToolKit raises safety from 53.21 to 86.34 and helpfulness from 52.92 to 80.79 while general capability (66.39→66.81) is nearly unchanged.

SaFeR-ToolKit

Novel technique introduced

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

Key Contributions

SaFeR-ToolKit: a structured safety protocol that formalizes VLM safety decisions as an auditable Perception→Reasoning→Decision tool trace, making jailbreak resistance checkable and traceable
First tool-based safety reasoning dataset (31,654 examples across SFT/DPO/GRPO stages) plus 1k held-out evaluation set
Three-stage curriculum training (SFT→DPO→GRPO) that directly supervises intermediate tool usage beyond answer-level feedback, achieving dramatic safety gains (29.39→84.40 on 3B) while preserving general capabilities

🛡️ Threat Analysis

Details

Domains

multimodalvisionnlp

Model Types

vlm

Threat Tags

inference_time

Datasets

Custom SaFeR-ToolKit dataset (31,654 training + 1k eval)MMSafetyFigStep

Applications

vision-language model safetymultimodal content moderationai assistant alignment

Read PDF arXiv Code

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification