defense 2025

Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

Fan Nie 1,2, Jiangqun Ni 1,2, Jian Zhang 3, Bin Zhang 2, Weizhe Zhang 4,5, Bin Li 6

1 citations · 60 references · arXiv

α

Published on arXiv

2511.19080

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

FoVB outperforms state-of-the-art methods across multiple benchmarks for generalizable audio-visual deepfake detection.

FoVB (Forgery-aware Audio-Visual Adaptation with Variational Bayes)

Novel technique introduced


The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.


Key Contributions

  • Reformulates audio-visual correlation learning via variational Bayesian estimation, modeling cross-modal correlations as a Gaussian distributed latent variable for more generalizable forgery representation.
  • Introduces Global-Local Forgery-aware Adaptation (GLFA) using difference convolutions and high-pass filters to extract local and global forgery traces from both audio and visual modalities.
  • Proposes Variational Bayesian Forgery Estimation (VBFE) that factorizes the latent variable into modality-specific and correlation-specific components with orthogonality constraints, reducing feature entanglement.

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel multi-modal deepfake detection architecture targeting AI-generated audio-visual content. The paper's primary contribution is a detection system for AIGC deepfakes, falling squarely under AI-generated content detection within Output Integrity.


Details

Domains
multimodalvisionaudiogenerative
Model Types
transformercnnmultimodal
Threat Tags
inference_timedigital
Applications
audio-visual deepfake detectionmulti-modal forensics