defense 2025

SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel , Gal lifshitz , Khen Cohen , Dan Raviv

0 citations · 57 references · arXiv

α

Published on arXiv

2511.21325

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SONAR achieves state-of-the-art performance on ASVspoof 2021 and in-the-wild benchmarks while converging four times faster than strong baselines.

SONAR

Novel technique introduced


Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.


Key Contributions

  • Dual-branch SONAR architecture with a Content Feature Extractor (XLSR) and a Noise Feature Extractor (learnable SRM + high-pass filters + XLSR) that explicitly disentangles low- and high-frequency audio representations
  • Frequency cross-attention module that reunites the two branches to model long- and short-range dependencies between low-frequency content and high-frequency residuals
  • Frequency-aware Jensen–Shannon contrastive loss that pulls real content–noise pairs together while pushing fake embeddings apart, accelerating convergence 4× over baselines

🛡️ Threat Analysis

Output Integrity Attack

Deepfake audio detection is an output integrity / content authenticity problem — SONAR verifies whether audio was generated by AI by exploiting high-frequency spectral artifacts left by generative models, a novel forensic detection architecture for AI-generated content.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2021In-the-Wild
Applications
deepfake audio detectionaudio forensicssynthetic speech detection