defense 2026

Multi-Channel Replay Speech Detection using Acoustic Maps

Michael Neri , Tuomas Virtanen

0 citations · 19 references · arXiv (Cornell University)

α

Published on arXiv

2602.16399

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

A lightweight ~6k-parameter CNN operating on beamforming-derived acoustic maps achieves competitive replay detection on ReMASC while remaining physically interpretable and array-agnostic.

Acoustic Maps

Novel technique introduced


Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.


Key Contributions

  • Acoustic maps — beamforming-derived spatial feature representations encoding directional energy distributions that distinguish human vocal radiation from loudspeaker-based replay
  • Compact ~6k-parameter CNN tailored to operate on acoustic map representations for replay detection
  • Evaluation under environment-dependent and environment-independent conditions across multiple microphone arrays and beamformer types on the ReMASC dataset

🛡️ Threat Analysis

Input Manipulation Attack

Replay attacks are inference-time input manipulation attacks that cause ASV models to misclassify replayed speech as genuine; the paper proposes acoustic maps as a detection countermeasure against this physical-access adversarial input threat.


Details

Domains
audio
Model Types
cnn
Threat Tags
inference_timephysicalblack_box
Datasets
ReMASC
Applications
speaker verificationvoice assistant authenticationanti-spoofing countermeasures