defense 2026

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue 1, Yi Chai 1, Yanzhen Ren 1, Jinshen He 2, Zhiqiang Tang 3, Zhuolin Yi 1, Yihuan Huang 1, Yuankun Xie 4, Yujie Chen 5

1 citations · 32 references · arXiv

α

Published on arXiv

2601.21463

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

PELM achieves EER of 0.57% on HumanEdit and 9.28% localization EER on AiEdit, significantly outperforming state-of-the-art detection methods on both datasets.

PELM (Prior-Enhanced Audio Large Language Model)

Novel technique introduced


Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.


Key Contributions

  • AiEdit: a large-scale bilingual dataset for high-quality neural speech editing detection, synthesized via LLM-driven semantic tampering logic and multiple neural speech editing methods
  • PELM: the first audio LLM framework that jointly performs speech editing detection and content localization as an audio question-answering task
  • Word-level probability priors and centroid-aggregation-based acoustic consistency perception loss to mitigate forgery bias and semantic-priority bias in audio LLMs

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel method to detect AI-generated/manipulated speech content (neural speech editing deepfakes) and localize the tampered segments — directly addresses output integrity and authenticity of audio content. Includes a new dataset (AiEdit) and novel architectural components (word-level probability priors, centroid-aggregation consistency loss) rather than merely applying existing methods to a domain.


Details

Domains
audionlp
Model Types
llmtransformer
Threat Tags
inference_timedigital
Datasets
HumanEditAiEdit
Applications
speech editing detectionaudio deepfake detectionaudio content localization