From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge
Hui Lu 1, Yi Yu 1, Song Xia 1, Yiming Yang 1, Deepu Rajan 1, Boon Poh Ng 1, Alex Kot 1,2, Xudong Jiang 1
Published on arXiv
2511.07049
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
TVA successfully deceives downstream task-specific models and MLLMs across 24 video-related tasks using only the open-source VFM backbone, without any task-specific data or victim model access.
TVA (Transferable Video Attack)
Novel technique introduced
Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.
Key Contributions
- TVA: a task-agnostic transferable adversarial attack on VFMs requiring only the open-source backbone, with no access to victim task, training data, queries, or fine-tuned architecture.
- Bidirectional temporal-aware contrastive loss that maximizes clean/adversarial feature discrepancy in both directions, correcting gradient asymmetry and reducing surrogate overfitting.
- Temporal consistency loss that disrupts inter-frame motion coherence to amplify sequential adversarial impact across video clips.
🛡️ Threat Analysis
TVA crafts imperceptible adversarial perturbations on video inputs using VFM embedding space and temporal consistency loss to cause misclassification across 24 downstream tasks — a classic inference-time input manipulation attack using gradient information from the surrogate VFM backbone.