defense 2026

MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Wenbo Xu 1, Wei Lu 1, Xiangyang Luo 2, Jiantao Zhou 3

0 citations · 37 references · arXiv

α

Published on arXiv

2601.20433

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MARE achieves state-of-the-art deepfake detection accuracy and reliability using VLMs augmented with RLHF and forgery disentanglement, outperforming existing LLM-based detection methods.

MARE

Novel technique introduced


Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.


Key Contributions

  • MARE framework combining RLHF-based reward functions with multimodal alignment to enhance VLM accuracy and explainability for deepfake detection
  • Forgery disentanglement module that captures intrinsic forgery traces from high-level facial semantics
  • Text-spatially aligned reasoning generation that produces both textual analysis and spatial localization of forgery evidence

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated content detection architecture specifically for deepfake face images, using VLMs with RLHF-based reward functions and a forgery disentanglement module — a novel detection/forensics technique for output integrity and content authenticity.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
inference_time
Datasets
FaceForensics++Celeb-DF
Applications
deepfake detectionfacial forgery detectionexplainable ai forensics