Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing
Danial Samadi Vahdati 1, Tai Duc Nguyen 1, Ekta Prashnani 2, Koki Nagano 2, David Luebke 2, Orazio Gallo 2, Matthew Stamm 2
Published on arXiv
2510.03548
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Consistently outperforms existing puppeteering defenses across multiple talking-head generation models while achieving real-time operation and strong generalization to out-of-distribution scenarios
AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.
Key Contributions
- First defense that operates directly on the transmitted pose-expression latent to detect puppeteering attacks — without ever reconstructing the RGB video
- Pose-conditioned large-margin contrastive encoder that disentangles persistent biometric identity cues from transient pose and expression within the latent space
- Real-time cosine similarity test on disentangled embeddings that flags illicit identity swaps across multiple talking-head generation models with strong out-of-distribution generalization
🛡️ Threat Analysis
The paper addresses output integrity of AI-synthesized video: an attacker hijacks a victim's likeness by puppeteering the pose-expression latent, producing unauthorized deepfake frames. The proposed defense detects this impersonation by verifying the biometric identity embedded in the latent before RGB reconstruction — a deepfake/AI-generated content authentication problem squarely in the output integrity space.