defense 2025

EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response

Chenpei Huang 1, Lingfeng Yao 1, Kyu In Lee 1, Lan Emily Zhang 2, Xun Chen , Miao Pan 1

0 citations · 30 references · arXiv

α

Published on arXiv

2511.06458

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

EchoMark achieves >99% watermark detection accuracy and BER <0.3% while maintaining a MOS of 4.22/5 for perceptual acoustic environment transfer, on par with FiNS

EchoMark

Novel technique introduced


Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of arbitrary ``relocation" if misused by malicious user, such as facilitating advanced voice spoofing attacks or undermining the authenticity of recorded evidence. To address this issue, we propose EchoMark, the first deep learning-based AEM framework that generates perceptually similar RIRs with embedded watermark. Our design tackle the challenges posed by variable RIR characteristics, such as different durations and energy decays, by operating in the latent domain. By jointly optimizing the model with a perceptual loss for RIR reconstruction and a loss for watermark detection, EchoMark achieves both high-quality environment transfer and reliable watermark recovery. Experiments on diverse datasets validate that EchoMark achieves room acoustic parameter matching performance comparable to FiNS, the state-of-the-art RIR estimator. Furthermore, a high Mean Opinion Score (MOS) of 4.22 out of 5, watermark detection accuracy exceeding 99\%, and bit error rates (BER) below 0.3\% collectively demonstrate the effectiveness of EchoMark in preserving perceptual quality while ensuring reliable watermark embedding.


Key Contributions

  • First deep learning AEM framework jointly optimizing perceptual RIR reconstruction and watermark embedding in the latent domain to handle variable RIR durations and energy decays
  • Watermark detection accuracy exceeding 99% with BER below 0.3% even after convolution with speech signals, demonstrating robustness to downstream audio processing
  • Achieves MOS of 4.22/5 for perceptual environment transfer quality, with room acoustic parameter matching comparable to state-of-the-art FiNS RIR estimator

🛡️ Threat Analysis

Output Integrity Attack

EchoMark watermarks model-generated RIR audio content to authenticate acoustic environment provenance and detect unauthorized 'relocation' — this is content watermarking of AI-generated audio outputs to address output integrity, analogous to LLM text or image watermarking for deepfake/spoofing prevention.


Details

Domains
audiogenerative
Model Types
transformer
Threat Tags
inference_time
Applications
audio dubbingvirtual reality audioaudio forensicsvoice spoofing prevention