defense 2026

HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang , Soyeon Caren Han , Qizhou Wang , Christopher Leckie

0 citations · 14 references · arXiv (Cornell University)

α

Published on arXiv

2602.01032

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 1.93% and 6.87% EER on ASVspoof 2021 DF and In-the-Wild datasets, outperforming independent layer-weighting baselines by 36.6% and 22.5% respectively.

HierCon

Novel technique introduced


Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.


Key Contributions

  • Hierarchical three-stage attention (temporal, intra-group, inter-group) over SSL transformer layers to capture complementary cues across the representational hierarchy
  • Margin-based contrastive regularisation that enforces domain-invariant embedding geometry, improving cross-domain generalisation
  • State-of-the-art EER of 1.93% (ASVspoof 2021 DF) and 6.87% (In-the-Wild), with 36.6% and 22.5% relative improvement over independent layer weighting

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated audio content detection architecture — detecting TTS/voice-conversion deepfakes is output integrity/authenticity (ML09). The contribution is a new forensic detection technique, not merely applying an existing detector to a new domain.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_timedigital
Datasets
ASVspoof 2021 DFIn-the-Wild
Applications
audio deepfake detectionanti-spoofingvoice authentication