benchmark 2026

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Hashim Ali , Nithin Sai Adupa , Surya Subramani , Hafiz Malik

0 citations

α

Published on arXiv

2603.01482

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Large-scale discriminative SSL models (XLS-R, UniSpeech-SAT, WavLM Large) consistently outperform generative and spectrogram-based models, and remain resilient under acoustic degradations where generative models degrade sharply.

Spoof-SUPERB

Novel technique introduced


Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.


Key Contributions

  • First SUPERB-style leaderboard evaluating 20 SSL models under a unified protocol (frozen front-ends, weighted layer sum, FC backend) for audio deepfake detection
  • Systematic analysis across six in-domain and out-of-domain spoofing datasets, identifying discriminative large-scale models (XLS-R, UniSpeech-SAT, WavLM Large) as consistently best
  • First structured comparison of SSL robustness under acoustic degradations (codecs, reverberation, noise), revealing sharp degradation of generative approaches vs. resilience of discriminative models

🛡️ Threat Analysis

Output Integrity Attack

Audio deepfake detection is explicitly listed under ML09 (AI-generated content detection). This paper establishes a reproducible benchmark for evaluating which SSL representations best detect AI-synthesized speech, contributing directly to output integrity and content authenticity.


Details

Domains
audio
Model Types
transformer
Datasets
ASVspoof 2019ASVspoof 2021 LAASVspoof 2021 DFDeepfakeEval 2024In-the-WildFamous FiguresASVSpoofLD
Applications
audio deepfake detectionanti-spoofingautomatic speaker verification