PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

Key Contributions

Reasoning-annotated dataset for deepfake detection that pairs synthetic images with grounded visual explanations
PRPO (Paragraph-level Relative Policy Optimization), an RL algorithm that aligns VLM reasoning with image content at the paragraph level
Empirical demonstration that PRPO significantly outperforms GRPO under test-time conditions, achieving a reasoning score of 4.55/5.0

🛡️ Threat Analysis

Output Integrity Attack

Deepfake detection is explicitly an ML09 concern (AI-generated content detection / output integrity). The paper proposes a novel forensic detection method — PRPO — not merely applying an existing detector to a new domain, but introducing a new RL-based training paradigm to ground multimodal reasoning in visual evidence for synthetic media detection.

Details

Domains

visionmultimodalnlp

Model Types

vlmllmtransformer

Threat Tags

inference_timedigital

Applications

2025 1 cit.

Output Integrity Attack

81%