GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
Zhenhao Zhu 1,2, Yue Liu 2, Yanpei Guo 2, Wenjie Qu 2, Cancan Chen 3, Yufei He 2, Yibo Li 2, Yulin Chen 2, Tianyi Wu 2, Huiying Xu 4, Xinzhong Zhu 4, Jiaheng Zhang 2
Published on arXiv
2602.03328
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
GuardReasoner-Omni (2B) surpasses the runner-up by 5.3% F1 on multi-modal guardrail benchmarks while existing guardrails like LLaMA Guard 4 suffer a 19.7% F1 collapse on video.
GuardReasoner-Omni
Novel technique introduced
We present GuardReasoner-Omni, a reasoning-based guardrail model designed to moderate text, image, and video data. First, we construct a comprehensive training corpus comprising 148k samples spanning these three modalities. Our training pipeline follows a two-stage paradigm to incentivize the model to deliberate before making decisions: (1) conducting SFT to cold-start the model with explicit reasoning capabilities and structural adherence; and (2) performing RL, incorporating an error-driven exploration reward to incentivize deeper reasoning on hard samples. We release a suite of models scaled at 2B and 4B parameters. Extensive experiments demonstrate that GuardReasoner-Omni achieves superior performance compared to existing state-of-the-art baselines across various guardrail benchmarks. Notably, GuardReasoner-Omni (2B) significantly surpasses the runner-up by 5.3% F1 score.
Key Contributions
- GuardReasoner-OmniTrain-148K: a large-scale multimodal safety dataset covering text, image, and video modalities
- Two-stage training pipeline combining SFT cold-start with GRPO reinforcement learning using hard sample mining and error-driven exploration reward
- First reasoning-based guardrail achieving state-of-the-art performance across all three modalities, outperforming runner-up by 5.3% F1