defense 2025

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Phuong Tuan Dat 1,2, Tran Huy Dat 2

1 citations · 19 references · AVSS

α

Published on arXiv

2510.06706

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

KAN integration into XLSR-Conformer achieves 60.55% relative EER improvement over baseline, with 0.70% EER on the ASVspoof2021 LA set and SOTA on the DF set.

XLSR-Kanformer

Novel technique introduced


Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.


Key Contributions

  • Proposes XLSR-Kanformer: replaces MLP layers in the XLSR-Conformer with Kolmogorov-Arnold Networks (KANs) for improved feature learning from SSL representations
  • Achieves 60.55% relative EER improvement on ASVspoof2021 LA and DF sets, reaching 0.70% EER on 21LA (new SOTA on DF set)
  • Demonstrates that KAN replacement is robust across multiple SSL backbone architectures

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel detection architecture (KAN-augmented conformer) for identifying AI-generated/synthetic speech — directly addresses output integrity by authenticating whether audio is genuine or machine-generated. This is a novel architectural contribution to deepfake audio detection, not a simple domain application of existing methods.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof2021
Applications
synthetic speech detectionautomatic speaker verificationanti-spoofing