Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models
Nikita Kuzmin 1,2, Tao Zhong 3,4, Jiajun Deng 3, Yingke Zhu 3, Tristan Tsoi 3, Tianxiang Cao 3, Simon Lui 3, Kong Aik Lee 5, Eng Siong Chng 1
1 Nanyang Technological University
2 A*STAR
3 Huawei Leibniz Research Center
Published on arXiv
2603.08179
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Anon-W2F raises speaker verification EER by 3.5× (11.2% → 41.0%), approaching the 50% random-chance ceiling, with first-response latency under 0.8 s
Anon-W2F / Anon-W2W (Stream-Voice-Anon)
Novel technique introduced
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
Key Contributions
- First empirical characterization of speaker identity leakage in E2E full-duplex LLM speech models (SALM-Duplex, Moshi) showing leakage persists across all transformer layers with near-perfect attacker EER as low as 6.4% for Moshi
- Layer-wise and turn-wise leakage analysis revealing Linkability rises sharply within the first few dialogue turns, posing GDPR compliance risks
- Two streaming anonymization setups (Anon-W2W waveform-level and Anon-W2F feature-domain) that raise EER from 11.2% to 41.0% while retaining 78–93% dialogue utility at sub-second latency