defense 2026

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao 1,2, Yuqi Li 1,2, Yunpeng Luo 1, Jianjun Yin 2, Long Ma 1

0 citations

α

Published on arXiv

2602.23393

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

AV-LMMDetect achieves 85.09% accuracy on the Mavos-DD open-set scenario, compared to 32.26% for the untuned Qwen 2.5 Omni base model, setting a new state of the art.

AV-LMMDetect

Novel technique introduced


Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.


Key Contributions

  • AV-LMMDetect: the first SFT large multimodal model reformulating audio-visual deepfake detection as prompted yes/no classification built on Qwen 2.5 Omni
  • Two-stage training strategy combining lightweight LoRA alignment (frozen encoders) followed by full audio-visual encoder fine-tuning for cross-modal synergy
  • State-of-the-art performance on Mavos-DD and competitive results on FakeAVCeleb, with 85.09% accuracy vs. 32.26% for the base model on the open-set full scenario

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated content detection architecture (AV-LMMDetect) for detecting audio-visual deepfakes, directly addressing output integrity by authenticating whether media content is real or synthetically generated.


Details

Domains
multimodalaudiovision
Model Types
vlmmultimodaltransformer
Threat Tags
inference_time
Datasets
FakeAVCelebMavos-DD
Applications
audio-visual deepfake detectionmedia forensics