Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection
Published on arXiv
2604.16808
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves AUC 0.905 on English, 0.779 on Chinese Mandarin, 0.969 on multi-ethnic FakeAVCeleb (σ=0.009 across five groups), and 0.843 on seven-language PolyGlotFake in zero-shot transfer using only 107,777 parameters
BioLip
Novel technique introduced
Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.
Key Contributions
- Physics-grounded detection using temporal lip jitter as a biomechanical constraint violation proxy, consistent across languages and ethnic groups
- Feature decomposition showing temporal kinematic features generalize cross-lingually while spectral features encode language-dependent patterns
- Privacy-preserving detection operating on 64 perioral landmarks without raw pixels or audio, deployable on edge devices
🛡️ Threat Analysis
Authenticates video content by detecting AI-generated lip-sync deepfakes. The paper addresses output integrity and content authenticity verification — determining whether video content (specifically lip motion) is authentic or synthetically generated. This is deepfake detection, which falls under ML09's scope of AI-generated content detection and content provenance.