Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs

Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.

Key Contributions

Demonstrates that quantum-kernel SVMs (QSVMs) consistently achieve lower equal-error rates than classical SVMs for audio deepfake detection across four corpora without increasing model size or trainable parameters.
Provides controlled comparison isolating the kernel's effect by keeping features, SVM solver, preprocessing, and cross-validation folds identical between QSVM and classical SVM.
Reports absolute false-positive-rate reductions of up to 56.9% (ADD23) and 38.8% (ASVspoof 5) at the EER operating point using mel-spectrogram features and a classically-simulated quantum kernel.

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is a detection method for AI-generated audio content (synthetic speech / audio deepfakes), explicitly targeting output integrity and content authenticity — a core ML09 concern. QSVMs are proposed as a novel detection architecture rather than a routine application of existing detectors.

Details

Domains

audio

Model Types

traditional_ml

Threat Tags

inference_time

Datasets

ASVspoof 2019 LAASVspoof 5 (2024)ADD23In-the-Wild

Applications

2025 0 cit.

Output Integrity Attack

78%