AI-Generated Music Detection in Broadcast Monitoring
David López-Ayala 1, Asier Cabello 2, Pablo Zinemanas 2, Emilio Molina 2, Martín Rocamora 1
Published on arXiv
2602.06823
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Models achieving high F1 in streaming scenarios drop below 60% F1 when music is background-masked by speech or restricted to short durations in broadcast conditions.
AI-OpenBMAT
Novel technique introduced
AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.
Key Contributions
- AI-OpenBMAT: the first dataset (3,294 one-minute excerpts, 54.9 hours) tailored to AI-generated music detection under broadcast conditions, pairing human-composed tracks with Suno v3.5 stylistic continuations at realistic broadcast SNR and duration distributions
- Systematic benchmarking of CNN baseline and SpectTTTra models across SNR sweeps, duration sensitivity tests, and full broadcast scenario evaluation
- Demonstration that state-of-the-art streaming-centric detectors degrade substantially in broadcast settings, with F1-scores falling below 60% under speech masking or short music duration
🛡️ Threat Analysis
The paper's core contribution is evaluating detectors of AI-generated audio content (music) — a direct instance of output integrity and content provenance verification. AI-generated content detection (deepfakes, synthetic audio, AI text) is explicitly within ML09 scope.