Generalizable Audio Spoofing Detection using Non-Semantic Representations
Arnab Das 1,2, Yassine El Kheir 1,2, Carlos Franzreb 1, Tim Herzig 1,3, Tim Polzehl 1,2, Sebastian Möller 1,3
Published on arXiv
2509.00186
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Significantly outperforms state-of-the-art approaches on out-of-domain (In the Wild) real-world data while achieving comparable in-domain ASVspoof performance
TrillFake
Novel technique introduced
Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.
Key Contributions
- Novel use of non-semantic universal audio representations (TRILL and TRILLsson) as features for audio spoofing/deepfake detection, motivated by the insight that discarding semantic content improves generalization
- Demonstrates superior out-of-domain generalization over SOTA methods based on hand-crafted features, semantic SSL embeddings (XLS-R, WavLM, HuBERT), and end-to-end architectures (RawNet2, AASIST)
- Cross-dataset evaluation on ASVspoof (in-domain) and In the Wild noisy public-domain data (out-of-domain) showing maintained in-domain competitiveness with significantly improved real-world robustness
🛡️ Threat Analysis
Proposes a novel AI-generated audio detection method — specifically a new forensic approach leveraging non-semantic representations to detect synthetic/spoofed speech with improved cross-dataset generalization; explicitly falls under deepfake detection and AI-generated content detection.