benchmark 2025

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Arnesh Batra ¹, Dev Sharma ¹, Krish Thukral ², Ruhani Bhatia ¹, Naman Batra ³, Aditya Gautam ¹

¹ Indraprastha Institute of Information Technology Delhi

² Manipal University Jaipur

³ Netaji Subhas University of Technology

0 citations · 34 references · TMLR

Published on arXiv

2512.00621

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

CLAM achieves F1=0.925 on the challenging MoM out-of-distribution benchmark, outperforming the prior SOTA (SpecTTTra, F1=0.869) and scoring 0.993 on SONICS.

CLAM (Contrastive Learning for Audio Matching)

Novel technique introduced

The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.

Key Contributions

MoM: a large-scale benchmark of 130,435 songs (6,665 hours) spanning 9 generative models with a dedicated out-of-distribution test set to measure real-world generalization
CLAM: a dual-stream detection architecture using MERT and Wave2Vec2 encoders fused via a cross-aggregation module, trained with BCE + contrastive triplet loss to exploit vocal-instrumental inconsistencies as a forensic signal
State-of-the-art results: F1 of 0.925 on MoM (vs. prior SOTA 0.869) and 0.993 on SONICS benchmark

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection — a core ML09 concern. CLAM is a novel forensic detection architecture that identifies synthetic music artifacts, and MoM is a benchmark explicitly designed to evaluate AI-generated audio detectors. Both contributions fall squarely under output integrity and content provenance.

Details

Domains

audio

Model Types

transformer

Threat Tags

inference_time

Datasets

MoMSONICS

Applications

synthetic music detectionai-generated audio forensics

Read PDF arXiv DOI

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

On Deepfake Voice Detection -- It's All in the Presentation

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study