Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Key Contributions

AV-LMMDetect: the first SFT large multimodal model reformulating audio-visual deepfake detection as prompted yes/no classification built on Qwen 2.5 Omni
Two-stage training strategy combining lightweight LoRA alignment (frozen encoders) followed by full audio-visual encoder fine-tuning for cross-modal synergy
State-of-the-art performance on Mavos-DD and competitive results on FakeAVCeleb, with 85.09% accuracy vs. 32.26% for the base model on the open-set full scenario

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated content detection architecture (AV-LMMDetect) for detecting audio-visual deepfakes, directly addressing output integrity by authenticating whether media content is real or synthetically generated.

Details

Domains

multimodalaudiovision

Model Types

vlmmultimodaltransformer

Threat Tags

inference_time

Datasets

FakeAVCelebMavos-DD

Applications

2026 0 cit.

Output Integrity Attack

87%