attack 2026

Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction

Diana Romero 1, Mutahar Ali 1, Momin Ahmad Khan 2, Habiba Farrukh 1, Fatima Anwar 2, Salma Elmalaki 1

0 citations

α

Published on arXiv

2604.08766

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Variable-output backdoor attacks evade cluster-based detection while maintaining attack effectiveness across multiple poisoning ratios, and survive quantization/deployment on both flagship and legacy smartphones


Scanpath prediction models forecast the sequence and timing of human fixations during visual search, driving foveated rendering and attention-based interaction in mobile systems where their integrity is a first-class security concern. We present the first study of backdoor attacks against VLM-based scanpath prediction, evaluated on GazeFormer and COCO-Search18. We show that naive fixed-path attacks, while effective, create detectable clustering in the continuous output space. To overcome this, we design two variable-output attacks: an input-aware spatial attack that redirects predicted fixations toward an attacker-chosen target object, and a scanpath duration attack that inflates fixation durations to delay visual search completion. Both attacks condition their output on the input scene, producing diverse and plausible scanpaths that evade cluster-based detection. We evaluate across three trigger modalities (visual, textual, and multimodal), multiple poisoning ratios, and five post-training defenses, finding that no defense simultaneously suppresses the attacks and preserves clean performance across all configurations. We further demonstrate that backdoor behavior survives quantization and deployment on both flagship and legacy commodity smartphones, confirming practical threat viability for edge-deployed gaze-driven systems.


Key Contributions

  • First backdoor attack study on VLM-based scanpath prediction with three trigger modalities (visual, textual, multimodal)
  • Two variable-output attacks: input-aware spatial attack redirecting fixations to target objects, and duration attack inflating fixation times
  • Demonstrates backdoor persistence through quantization and deployment on commodity smartphones, with no defense successfully suppressing attacks while preserving clean performance

🛡️ Threat Analysis

Model Poisoning

Proposes backdoor/trojan attacks against VLM-based scanpath prediction models with trigger-based behavior (visual, textual, and multimodal triggers) that activate malicious scanpath manipulation while the model behaves normally otherwise. Evaluates multiple backdoor insertion techniques and defenses.


Details

Domains
visionmultimodal
Model Types
vlmtransformermultimodal
Threat Tags
training_timetargeteddigital
Datasets
COCO-Search18
Applications
scanpath predictiongaze predictionmobile foveated renderingattention-based interaction