SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

Key Contributions

Multi-Interval Sampling (MIS) with nullifying supervision across diffusion stages to steer generation toward static reference portrait
Cross-Attention Fooling (CAF) to suppress audio-conditioned cross-attention responses
Modality-independent optimization enabling flexible deployment of image-only, audio-only, or joint perturbations

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial perturbations applied to both image and audio inputs at inference time to cause generation failure (disrupting lip sync and facial dynamics). This is a proactive defense using adversarial examples to protect against misuse of talking-head generation models.

Output Integrity Attack

The paper's goal is to protect portrait and audio content from unauthorized deepfake synthesis by embedding imperceptible perturbations that disrupt generation. This is content protection/integrity — preventing AI models from misusing protected content to generate realistic deepfakes.