Leveraging large multimodal models for audio-video deepfake detection: a pilot study
Songjun Cao 1,2, Yuqi Li 1,2, Yunpeng Luo 1, Jianjun Yin 2, Long Ma 1
Published on arXiv
2602.23393
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
AV-LMMDetect achieves 85.09% accuracy on the Mavos-DD open-set scenario, compared to 32.26% for the untuned Qwen 2.5 Omni base model, setting a new state of the art.
AV-LMMDetect
Novel technique introduced
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Key Contributions
- AV-LMMDetect: the first SFT large multimodal model reformulating audio-visual deepfake detection as prompted yes/no classification built on Qwen 2.5 Omni
- Two-stage training strategy combining lightweight LoRA alignment (frozen encoders) followed by full audio-visual encoder fine-tuning for cross-modal synergy
- State-of-the-art performance on Mavos-DD and competitive results on FakeAVCeleb, with 85.09% accuracy vs. 32.26% for the base model on the open-set full scenario
🛡️ Threat Analysis
Proposes a novel AI-generated content detection architecture (AV-LMMDetect) for detecting audio-visual deepfakes, directly addressing output integrity by authenticating whether media content is real or synthetically generated.