defense 2026

Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

Wasim Ahmad , Wei Zhang , Xuerui Mao

0 citations

α

Published on arXiv

2604.26453

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 99.7% balanced accuracy and 99.8% AUC on FakeAVCeleb with 95.9% attribution accuracy; demonstrates that attribution-guided learning produces generator-specific embedding clusters vs. unstructured fake clusters in binary-only training

AMDD

Novel technique introduced


Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.


Key Contributions

  • Attribution-guided detection framework (AMDD) that uses generator attribution as structured regularization to learn forensic fingerprints rather than dataset shortcuts
  • Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss that enforces alignment between generator-induced artifacts across visual and audio streams
  • Architectural design pairing ResNet50 with temporal attention for video and ResNet18 for mel spectrograms to close encoder capacity gaps

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is detecting and attributing AI-generated audio-visual content (deepfakes). The paper develops forensic techniques to verify content authenticity and trace which generator produced synthetic media — this is output integrity and content provenance, the core of ML09.


Details

Domains
multimodalvisionaudio
Model Types
multimodalcnngan
Threat Tags
inference_time
Datasets
FakeAVCelebDeepfakeTIMITDFDMLAV-DF
Applications
deepfake detectionmedia forensicscontent authenticationbiometric security