Explainable Deepfake Detection with RL Enhanced Self-Blended Images
Ning Jiang 1, Dingheng Zeng 2, Yanhong Liu 2, Haiyang Yi 2, Shijie Yu 2, Minghe Weng 2, Haifeng Shen 2, Ying Li 1
Published on arXiv
2601.15624
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves detection performance competitive with SOTA across multiple cross-dataset benchmarks without labor-intensive manual annotation, while providing explainable Chain-of-Thought forgery reasoning.
RLSBI (RL-enhanced Self-Blended Images)
Novel technique introduced
Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.
Key Contributions
- Automated Chain-of-Thought data generation framework using Self-Blended Images that extracts forgery conditions and maps them to forgery regions without manual annotation
- First application of R1-style GRPO reinforcement learning to face forgery detection, with a keyword-driven reward mechanism to address reward sparsity in binary classification
- Adaptive feedback mechanism integrated with GRPO training that dynamically adjusts synthetic data generation based on historical reward values
🛡️ Threat Analysis
Primary contribution is a novel deepfake detection framework — detecting AI-generated face forgeries — which directly addresses AI-generated content detection and output integrity. The paper introduces new detection architecture (MLLM + GRPO RL), CoT data generation, and reward mechanisms specifically for deepfake forensics.