defense 2025

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

Xi Xuan 1,2, Zimo Zhu 3, Wenxin Zhang 3,4, Yi-Cheng Lin 5, Tomi Kinnunen 1

0 citations

α

Published on arXiv

2508.09294

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER on ASVspoof 21 LA, 21 DF, and In-the-Wild benchmarks respectively, with substantial gains over SOTA XLSR-Conformer and XLSR-Mamba while maintaining real-time inference.

Fake-Mamba

Novel technique introduced


Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR's rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.


Key Contributions

  • First framework replacing multi-head self-attention with bidirectional state-space modeling (BiMamba) for speech deepfake detection
  • Three novel encoder variants (TransBiMamba, ConBiMamba, PN-BiMamba) offering efficient spectral-temporal feature learning with near-linear complexity
  • Achieves 0.97%, 1.74%, and 5.85% EER on ASVspoof 2021 LA, DF, and In-the-Wild benchmarks, surpassing XLSR-Conformer and XLSR-Mamba, with real-time inference across variable utterance lengths

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated content detection architecture (Fake-Mamba) specifically for synthetic speech — this is audio deepfake detection, a direct instance of output integrity and content authenticity verification. The contribution is a novel forensic detection architecture (TransBiMamba/ConBiMamba/PN-BiMamba), not a mere application of existing methods to a new domain.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2021 LAASVspoof 2021 DFIn-the-Wild
Applications
speech deepfake detectionsynthetic speech detectionvoice anti-spoofing