EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
Haoran Sun 1, Chen Cai 2, Huiping Zhuang 3, Kong Aik Lee 1, Lap-Pui Chau 1, Yi Wang 1
Published on arXiv
2510.16442
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
EDVD-LLaMA achieves superior detection accuracy and generalization over traditional DVD methods across cross-forgery and cross-dataset scenarios while providing traceable reasoning explanations.
EDVD-LLaMA
Novel technique introduced
The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The project page is available at: https://11ouo1.github.io/edvd-llama/.
Key Contributions
- Spatio-Temporal Subtle Information Tokenization (ST-SIT) module that extracts and fuses global and local cross-frame deepfake features for MLLM reasoning
- Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism using facial feature data as hard constraints to achieve pixel-level spatio-temporal localization and suppress hallucinations
- Explainable Reasoning FF++ dataset (ER-FF++set) with structured annotations supporting dual supervision for both detection and reasoning tasks
🛡️ Threat Analysis
Paper proposes a novel architecture for detecting AI-generated/deepfake video content — a forensic detection method for output integrity and content authenticity. It introduces new modules (ST-SIT for spatio-temporal feature extraction, Fg-MCoT for pixel-level localization) and a new benchmark dataset (ER-FF++set), qualifying as a novel detection architecture rather than mere application of existing methods.