benchmark 2026

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni 1,2,3, Sandipana Dowerah 2, Atharva Kulkarni 3, Tanel Alumäe 2, Mathew Magimai Doss 1

0 citations

α

Published on arXiv

2603.06164

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Compact iterative multilingual mHuBERT (~100M parameters) matches or outperforms systems 5–20× larger and commercial detectors on cross-domain benchmarks, while WavLM variants exhibit overconfident miscalibration under TTA perturbation that EER alone cannot detect.

RAPTOR

Novel technique introduced


Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.


Key Contributions

  • RAPTOR: a pairwise-gated hierarchical layer-fusion detector used as a controlled evaluation framework across six compact SSL backbones
  • Controlled comparison of HuBERT and WavLM families (≈100M parameters) across 14 cross-domain benchmarks, showing multilingual iterative pre-training — not scale — is the primary driver of cross-domain robustness
  • TTA-based aleatoric uncertainty protocol that exposes overconfident miscalibration in WavLM variants invisible to standard EER metrics

🛡️ Threat Analysis

Output Integrity Attack

The paper's central contribution is detecting AI-generated audio (deepfakes). RAPTOR is a novel detection architecture, and the TTA uncertainty protocol is a new evaluation methodology for AI-generated content detection — both squarely ML09 output integrity / AI content authenticity.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2019Speech DF Arena
Applications
audio deepfake detectionsynthetic speech detection