Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.

Key Contributions

Attribution-guided detection framework (AMDD) that uses generator attribution as structured regularization to learn forensic fingerprints rather than dataset shortcuts
Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss that enforces alignment between generator-induced artifacts across visual and audio streams
Architectural design pairing ResNet50 with temporal attention for video and ResNet18 for mel spectrograms to close encoder capacity gaps

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is detecting and attributing AI-generated audio-visual content (deepfakes). The paper develops forensic techniques to verify content authenticity and trace which generator produced synthetic media — this is output integrity and content provenance, the core of ML09.

Details

Domains

multimodalvisionaudio

Model Types

multimodalcnngan

Threat Tags

inference_time

Datasets

FakeAVCelebDeepfakeTIMITDFDMLAV-DF

Applications

2025 0 cit.

Output Integrity Attack

80%