defense 2025

NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian 1, Rui Liu 1, Berrak Sisman 2, Haizhou Li 3

0 citations

α

Published on arXiv

2509.03829

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

NE-PADD outperforms all advanced baselines on the PartialSpoof-NER dataset, demonstrating that integrating named entity knowledge improves frame-level synthetic speech detection.

NE-PADD

Novel technique introduced


Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.


Key Contributions

  • NE-PADD framework with two parallel branches (SpeechNER + PADD) that integrates named entity knowledge for partial audio deepfake detection
  • Attention Fusion (AF) mechanism combining attention weights from SpeechNER and PADD branches
  • Attention Transfer (AT) mechanism using auxiliary loss to guide PADD with named entity semantic information, plus the PartialSpoof-NER dataset

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel architecture for detecting AI-generated/synthetic speech content at the frame level within audio — this is output integrity and content authenticity detection for AI-generated audio deepfakes.


Details

Domains
audio
Model Types
transformer
Threat Tags
digitalinference_time
Datasets
PartialSpoofPartialSpoof-NER
Applications
audio deepfake detectionpartial speech forgery localization