defense 2026

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Jaskirat Sudan , Hashim Ali , Surya Subramani , Hafiz Malik

0 citations

α

Published on arXiv

2604.26057

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Cosine SupCon with delayed queue achieves 8.29% EER on in-the-wild data and 4.44 pooled EER; angular similarity achieves 8.70% ITW EER without queued negatives

Angular Similarity SupCon

Novel technique introduced


Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.


Key Contributions

  • Controlled study isolating the effect of similarity function (cosine vs angular) and negative scaling in supervised contrastive learning for deepfake audio detection
  • Demonstrates that angular similarity reduces reliance on large negative sets while maintaining strong cross-dataset performance
  • Achieves best in-the-wild EER of 8.29% using cosine SupCon with delayed queue on ASVspoof benchmarks

🛡️ Threat Analysis

Output Integrity Attack

Audio deepfake detection is a form of AI-generated content detection — verifying whether audio is authentic or synthetic. The paper proposes a detector to identify AI-generated speech, which is output integrity/content authenticity (ML09).


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2019 LAASVspoof 2021 DFASVspoof 2021 LAIn-The-Wild (ITW)
Applications
audio deepfake detectionspeech anti-spoofing