MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.

Key Contributions

MARE framework combining RLHF-based reward functions with multimodal alignment to enhance VLM accuracy and explainability for deepfake detection
Forgery disentanglement module that captures intrinsic forgery traces from high-level facial semantics
Text-spatially aligned reasoning generation that produces both textual analysis and spatial localization of forgery evidence

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated content detection architecture specifically for deepfake face images, using VLMs with RLHF-based reward functions and a forgery disentanglement module — a novel detection/forensics technique for output integrity and content authenticity.

Details

Domains

visionmultimodal

Model Types

vlmtransformer

Threat Tags

inference_time

Datasets

FaceForensics++Celeb-DF

Applications

2025 3 cit.

Output Integrity Attack

92%