defense 2026

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Wenli Zhang 1, Xianglong Shi 1, Sirui Zhao 1, Xinqi Chen 1, Guo Cheng 2, Yifan Xu 1, Tong Xu 1, Yong Liao 1

0 citations

α

Published on arXiv

2604.08405

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Joint image+audio perturbations more effectively degrade lip synchronization and facial dynamics than single-modality baselines while preserving perceptual quality and remaining robust under purification

SyncBreaker

Novel technique introduced


Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.


Key Contributions

  • Multi-Interval Sampling (MIS) with nullifying supervision across diffusion stages to steer generation toward static reference portrait
  • Cross-Attention Fooling (CAF) to suppress audio-conditioned cross-attention responses
  • Modality-independent optimization enabling flexible deployment of image-only, audio-only, or joint perturbations

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial perturbations applied to both image and audio inputs at inference time to cause generation failure (disrupting lip sync and facial dynamics). This is a proactive defense using adversarial examples to protect against misuse of talking-head generation models.

Output Integrity Attack

The paper's goal is to protect portrait and audio content from unauthorized deepfake synthesis by embedding imperceptible perturbations that disrupt generation. This is content protection/integrity — preventing AI models from misusing protected content to generate realistic deepfakes.


Details

Domains
multimodalaudiovisiongenerative
Model Types
diffusionmultimodal
Threat Tags
white_boxinference_timedigital
Datasets
CelebA-HQ-LibriSpeechHDTF
Applications
deepfake protectiontalking-head generationportrait animation