defense 2026

Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection

Yumin Kim , Seonghyeon Go

0 citations · 21 references · arXiv

α

Published on arXiv

2601.13647

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves state-of-the-art AI-generated music detection on SONICS and AIME datasets by fusing content and structural segment embeddings through bi-directional cross-attention and adaptive gating

Fusion Segment Transformer

Novel technique introduced


With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach outperforms the previous model and recent baselines, achieving state-of-the-art results in AI-generated music detection.


Key Contributions

  • Fusion Segment Transformer extending prior Segment Transformer with a Gated Fusion Layer combining content and structural streams via bi-directional cross-attention
  • Integration of Muffin Encoder into the Stage-1 embedding pipeline to capture high-frequency spectral artifacts across multi-band Mel-spectrograms
  • State-of-the-art AI-generated music detection results on SONICS and AIME datasets for full-length audio

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel AI-generated content detection architecture (Fusion Segment Transformer) that verifies the provenance/authenticity of music by distinguishing human-composed from AI-generated audio — this is output integrity and content authenticity detection.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
SONICSAIME
Applications
ai-generated music detectionaudio content authenticity