Multilingual Source Tracing of Speech Deepfakes: A First Benchmark
Xi Xuan 1,2, Yang Xiao 3,4, Rohan Kumar Das 4, Tomi Kinnunen 1
Published on arXiv
2508.04143
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Language mismatch between training and inference significantly degrades speech deepfake source tracing performance, with SSL-based models showing better cross-lingual generalization than DSP-based approaches.
Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.
Key Contributions
- First benchmark dataset and evaluation protocol for multilingual speech deepfake source tracing, covering mono- and cross-lingual scenarios
- Comparative investigation of DSP-based vs. SSL-based models for source tracing, including fine-tuning on different languages and cross-lingual generalization
- Systematic analysis of language mismatch effects on source tracing, including generalization to unseen languages and unseen speakers
🛡️ Threat Analysis
Speech deepfake source tracing is a content provenance and attribution task — identifying which generative model produced AI-synthesized audio falls under output integrity and AI-generated content authentication, explicitly covered by ML09.