SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
Zixuan Xu 1, Tiancheng He 2, Huahui Yi 3, Kun Wang 4, Xi Chen 3, Gongli Xi 2, Qiankun Li 4, Kang Li 3, Yang Liu 4, Zhigang Zeng 1
1 Huazhong University of Science and Technology
2 Beijing University of Posts and Telecommunications
Published on arXiv
2603.02635
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
On Qwen2.5-VL-7B, SaFeR-ToolKit raises safety from 53.21 to 86.34 and helpfulness from 52.92 to 80.79 while general capability (66.39→66.81) is nearly unchanged.
SaFeR-ToolKit
Novel technique introduced
Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.
Key Contributions
- SaFeR-ToolKit: a structured safety protocol that formalizes VLM safety decisions as an auditable Perception→Reasoning→Decision tool trace, making jailbreak resistance checkable and traceable
- First tool-based safety reasoning dataset (31,654 examples across SFT/DPO/GRPO stages) plus 1k held-out evaluation set
- Three-stage curriculum training (SFT→DPO→GRPO) that directly supervises intermediate tool usage beyond answer-level feedback, achieving dramatic safety gains (29.39→84.40 on 3B) while preserving general capabilities