defense 2026

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

Anh-Tuan Dao 1, Driss Matrouf 1, Nicholas Evans 2

0 citations · 19 references · arXiv (Cornell University)

α

Published on arXiv

2602.20805

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

The speaker-invariant MHFA-IVspk model reduces average equal error rate by 17.2% across four datasets compared to the baseline, with up to 48% EER reduction on the most challenging attack (A11).

SInMT (Speaker-Invariant Multi-Task)

Novel technique introduced


Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).


Key Contributions

  • SInMT framework that unifies speaker-aware and speaker-invariant spoofing detection via a dual-head MHFA classifier with optional gradient reversal layer (GRL)
  • Empirical demonstration that removing speaker identity information (via GRL) reduces average EER by 17.2% across four datasets and 48% for the hardest attack type (A11)
  • Analysis of how explicit speaker identity modeling (speaker-aware) vs. suppression (speaker-invariant) each improve over the baseline MHFA spoofing detector

🛡️ Threat Analysis

Output Integrity Attack

The paper proposes and evaluates a detection system for spoofed/synthetic speech (audio deepfakes), which is AI-generated content detection — explicitly listed under ML09 output integrity. The SInMT framework improves the detector's ability to identify TTS and voice-conversion attacks.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2019ASVspoof 5
Applications
audio deepfake detectionautomatic speaker verificationspeech spoofing countermeasures