SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
Wenli Zhang 1, Xianglong Shi 1, Sirui Zhao 1, Xinqi Chen 1, Guo Cheng 2, Yifan Xu 1, Tong Xu 1, Yong Liao 1
Published on arXiv
2604.08405
Input Manipulation Attack
OWASP ML Top 10 — ML01
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Joint image+audio perturbations more effectively degrade lip synchronization and facial dynamics than single-modality baselines while preserving perceptual quality and remaining robust under purification
SyncBreaker
Novel technique introduced
Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.
Key Contributions
- Multi-Interval Sampling (MIS) with nullifying supervision across diffusion stages to steer generation toward static reference portrait
- Cross-Attention Fooling (CAF) to suppress audio-conditioned cross-attention responses
- Modality-independent optimization enabling flexible deployment of image-only, audio-only, or joint perturbations
🛡️ Threat Analysis
Proposes adversarial perturbations applied to both image and audio inputs at inference time to cause generation failure (disrupting lip sync and facial dynamics). This is a proactive defense using adversarial examples to protect against misuse of talking-head generation models.
The paper's goal is to protect portrait and audio content from unauthorized deepfake synthesis by embedding imperceptible perturbations that disrupt generation. This is content protection/integrity — preventing AI models from misusing protected content to generate realistic deepfakes.