OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL
Jinjie Shen 1,2, Jing Wu 1, Yaxiong Wang 1, Lechao Cheng 1, Shengeng Tang 1, Tianrui Hui 1, Nan Pu 1, Zhun Zhong 1
Published on arXiv
2602.10687
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
OmniVL-Guard significantly outperforms state-of-the-art MLLMs (GPT-5, Gemini3, Seed1.6) on both binary classification and localization tasks, with strong zero-shot generalization to out-of-domain scenarios.
ARSPO (Adaptive Reward Scaling Policy Optimization)
Novel technique introduced
Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
Key Contributions
- Unified omnibus framework for vision-language forgery detection and grounding across interleaved text, images, and videos
- Self-Evolving CoT Generation strategy to synthesize high-quality reasoning paths for cold-start bootstrapping
- Adaptive Reward Scaling Policy Optimization (ARSPO) that dynamically modulates reward scales and task weights to address difficulty bias between classification and grounding tasks
🛡️ Threat Analysis
The paper's primary contribution is a system for detecting and localizing forged/manipulated content across multiple modalities (text, images, video) — deepfake detection and multimodal content authenticity verification falls squarely under Output Integrity Attack (AI-generated content detection).