defense 2026

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Zhenhao Zhu ^1,2, Yue Liu ², Yanpei Guo ², Wenjie Qu ², Cancan Chen ³, Yufei He ², Yibo Li ², Yulin Chen ², Tianyi Wu ², Huiying Xu ⁴, Xinzhong Zhu ⁴, Jiaheng Zhang ²

¹ Tsinghua University

² National University of Singapore

³ Sun Yat-sen University

⁴ Zhejiang Normal University

0 citations · 57 references · arXiv (Cornell University)

Published on arXiv

2602.03328

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GuardReasoner-Omni (2B) surpasses the runner-up by 5.3% F1 on multi-modal guardrail benchmarks while existing guardrails like LLaMA Guard 4 suffer a 19.7% F1 collapse on video.

GuardReasoner-Omni

Novel technique introduced

We present GuardReasoner-Omni, a reasoning-based guardrail model designed to moderate text, image, and video data. First, we construct a comprehensive training corpus comprising 148k samples spanning these three modalities. Our training pipeline follows a two-stage paradigm to incentivize the model to deliberate before making decisions: (1) conducting SFT to cold-start the model with explicit reasoning capabilities and structural adherence; and (2) performing RL, incorporating an error-driven exploration reward to incentivize deeper reasoning on hard samples. We release a suite of models scaled at 2B and 4B parameters. Extensive experiments demonstrate that GuardReasoner-Omni achieves superior performance compared to existing state-of-the-art baselines across various guardrail benchmarks. Notably, GuardReasoner-Omni (2B) significantly surpasses the runner-up by 5.3% F1 score.

Key Contributions

GuardReasoner-OmniTrain-148K: a large-scale multimodal safety dataset covering text, image, and video modalities
Two-stage training pipeline combining SFT cold-start with GRPO reinforcement learning using hard sample mining and error-driven exploration reward
First reasoning-based guardrail achieving state-of-the-art performance across all three modalities, outperforming runner-up by 5.3% F1

🛡️ Threat Analysis

Details

Domains

multimodalnlpvision

Model Types

vlmllmmultimodal

Threat Tags

inference_time

Datasets

GuardReasoner-OmniTrain-148KHarmTextVideoUCF-CrimeSafeWatchVideoSafetyBench

Applications

content moderationllm safety guardrailsmultimodal harmful content detection

Read PDF arXiv DOI Code

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety