defense 2026

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie ^1,2, Xiaoxuan Guo ^1,2, Jiayi Zhou ², Tao Wang ², Jian Liu ², Ruibo Fu ³, Xiaopeng Wang ³, Haonan Cheng ¹, Long Ye ¹

¹ Communication University of China

² Ant Group

³ Chinese Academy of Sciences

4 citations · 45 references · arXiv

Published on arXiv

2601.02983

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

FT-GRPO achieves 99.75% accuracy on ASVspoof2019LA and 90.10% average accuracy across all audio types while producing interpretable frequency-time grounded rationales.

FT-GRPO (Frequency Time-Group Relative Policy Optimization)

Novel technique introduced

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.

Key Contributions

Automatic annotation-and-polishing pipeline that constructs ~340K frequency-time structured chain-of-thought (CoT) rationales for audio deepfake detection datasets
FT-GRPO: a two-stage training paradigm combining SFT cold start with GRPO under rule-based frequency-time domain constraints to improve interpretability and generalization
State-of-the-art all-type audio deepfake detection achieving 99.75% accuracy on ASVspoof2019LA and 90.10% average accuracy across all audio types

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel detection architecture for AI-generated audio content (deepfake speech, singing voice, environmental sounds, music) — squarely within AI-generated content detection under Output Integrity. The paper introduces FT-GRPO, a new training paradigm, not a domain application of existing methods.

Details

Domains

audioreinforcement-learningnlp

Model Types

llm

Threat Tags

inference_time

Datasets

ASVspoof2019LASVDDFakeSoundFakeMusicCaps

Applications

audio deepfake detectionspeech deepfake detectionsinging voice deepfake detectionenvironmental sound deepfake detectionmusic deepfake detection

Read PDF arXiv DOI Code

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

Online LLM watermark detection via e-processes

Refined Detection for Gumbel Watermarking

RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware Watermarking Framework

Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification