defense arXiv Apr 18, 2026 ยท 4w ago
Hao Chen, Junnan Xu
Detects lip-sync deepfakes via biomechanical constraint violations in perioral motion, achieving cross-lingual generalization without audio or pixels
Output Integrity Attack visionmultimodal
Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.
gan diffusion generative