Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Key Contributions

Cross-architecture ensemble approach for robust deepfake video classification across audio and visual domains
Fine-grained temporal localization of deepfake manipulations (identifying which segments are synthetic)
Audio-visual fusion countermeasure achieving #1 in temporal localization and top-4 in classification on the ACM 1M Deepfakes Detection Challenge TestA split

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses detection of AI-generated synthetic content (deepfakes) across audio and visual modalities, including fine-grained localization of manipulated segments — core output integrity and content authenticity work.

Details

Domains

audiovisionmultimodal

Model Types

transformermultimodal

Threat Tags

inference_timedigital

Datasets

ACM 1M Deepfakes Detection Challenge dataset

Applications

2025 0 cit.

Output Integrity Attack

93%