defense arXiv Mar 26, 2026 · 11d ago
Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi et al. · National Institute of Informatics · Academia Sinica +2 more
Self-supervised multimodal deepfake detector trained on real videos, detecting visual tampering artifacts and audio-visual lip-sync inconsistencies
Output Integrity Attack multimodalvisionaudio
Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.
multimodal cnn transformer National Institute of Informatics · Academia Sinica · National Chengchi University +1 more
benchmark arXiv Jan 7, 2025 · Jan 2025
Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen et al. · NVIDIA · National Taiwan University +1 more
Benchmark dataset (SINE) for seamless AI speech edit detection, revealing gaps in cut-and-paste-trained detectors
Output Integrity Attack audio
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
transformer diffusion NVIDIA · National Taiwan University · Academia Sinica