defense 2026

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Miao Liu 1, Fangda Wei 1, Jing Wang 1, Xinyuan Qian 2

0 citations

α

Published on arXiv

2604.12650

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MANet achieves significantly superior performance on ListenForge compared to existing Speaking Deepfake Detection models, which perform poorly in listening scenarios

MANet

Novel technique introduced


Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.


Key Contributions

  • First dataset (ListenForge) specifically for listening deepfake detection using 5 Listening Head Generation methods
  • MANet architecture combining motion-aware analysis with audio-guided cross-modal fusion
  • Demonstrates existing speaking-centric deepfake detectors fail on listening scenarios

🛡️ Threat Analysis

Output Integrity Attack

Paper addresses detection and authentication of AI-generated video content (listening reactions/heads), which falls under output integrity and AI-generated content detection. The task is verifying whether listener video is authentic or synthetically generated.


Details

Domains
multimodalvisionaudio
Model Types
multimodalcnn
Threat Tags
inference_timedigital
Datasets
ListenForge
Applications
video conferencinginteractive communicationdeepfake detection