PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
Tuan T. Nguyen 1, Naseem Khan 1, Khang Tran 2, NhatHai Phan 2, Issa Khalil 1
Published on arXiv
2509.26272
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
PRPO achieves the highest reasoning score of 4.55/5.0 and significantly outperforms GRPO for deepfake detection accuracy and explanation quality.
PRPO (Paragraph-level Relative Policy Optimization)
Novel technique introduced
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
Key Contributions
- Reasoning-annotated dataset for deepfake detection that pairs synthetic images with grounded visual explanations
- PRPO (Paragraph-level Relative Policy Optimization), an RL algorithm that aligns VLM reasoning with image content at the paragraph level
- Empirical demonstration that PRPO significantly outperforms GRPO under test-time conditions, achieving a reasoning score of 4.55/5.0
🛡️ Threat Analysis
Deepfake detection is explicitly an ML09 concern (AI-generated content detection / output integrity). The paper proposes a novel forensic detection method — PRPO — not merely applying an existing detector to a new domain, but introducing a new RL-based training paradigm to ground multimodal reasoning in visual evidence for synthetic media detection.