defense 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan 1, Xuechen Liu 2,3, Wenxin Zhang 1,4, Yi-Cheng Lin 5, Xiaojian Lin 6, Tomi Kinnunen 1

1 citations · 41 references · arXiv

α

Published on arXiv

2510.05305

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

WaveSP-Net outperforms several SOTA models on Deepfake-Eval-2024 and SpoofCeleb with notably low trainable parameter count by injecting multi-resolution wavelet features into frozen XLSR prompt embeddings

WaveSP-Net

Novel technique introduced


Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.


Key Contributions

  • Three novel parameter-efficient XLSR front-end variants (FourierPT-XLSR, WSPT-XLSR, Partial-WSPT-XLSR) that fuse prompt-tuning with classical DSP transforms
  • WaveSP-Net: a full architecture combining Partial-WSPT-XLSR front-end with a bidirectional Mamba-based back-end for speech deepfake detection
  • State-of-the-art results on Deepfake-Eval-2024 and SpoofCeleb benchmarks with substantially fewer trainable parameters than full fine-tuning baselines

🛡️ Threat Analysis

Output Integrity Attack

Proposes a new detection architecture (WaveSP-Net) to identify AI-generated or manipulated speech — a direct contribution to output integrity / AI-generated content detection in the audio domain.


Details

Domains
audio
Model Types
transformerrnn
Threat Tags
inference_time
Datasets
Deepfake-Eval-2024SpoofCeleb
Applications
speech deepfake detectionspeaker verification security