attack 2025

Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao ¹, Wei Song ^2,3, Derui Wang ³, Jingling Xue ², Jin Song Dong ¹

¹ National University of Singapore

² University of New South Wales

³ CSIRO’s Data61

0 citations

Published on arXiv

2508.10974

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Across five VideoLLMs, harmful content omission rates exceed 90% in most attack conditions — FRA achieves 99%/91%/100% for violence/crime/pornography respectively.

FRA/PPA/TOA (Frame-Replacement, Picture-in-Picture, Transparent-Overlay Attacks)

Novel technique introduced

Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

Key Contributions

First systematic study of harmful content omission vulnerability in VideoLLMs, identifying three root-cause structural flaws: temporal sparse sampling, spatial token downsampling, and modality fusion imbalance.
Three zero-query black-box attacks (FRA, PPA, TOA) each tailored to exploit one or more of these structural flaws without any model queries.
Large-scale evaluation across five VideoLLMs and three harmful content types, demonstrating harmfulness omission rates exceeding 90% in most scenarios.

🛡️ Threat Analysis

Input Manipulation Attack

The three crafted attacks (FRA, PPA, TOA) manipulate visual video inputs to VLMs by exploiting architectural weaknesses — sparse temporal sampling, spatial downsampling, and modality fusion imbalance — to cause the models to omit harmful content from outputs. PPA specifically uses corner patches on frames analogous to adversarial patches. This is input manipulation causing incorrect/misleading VLM outputs at inference time.

Details

Domains

multimodalvisionnlp

Model Types

vlmllm

Threat Tags

black_boxinference_timetargeted

Datasets

VideoLLaMA2 test clipsLLaVA-Video evaluation set

Applications

video summarizationcontent moderationvideo understanding

Read PDF arXiv Code

Failures to Surface Harmful Contents in Video Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Black-box Optimization of LLM Outputs by Asking for Directions

Text Prompt Injection of Vision Language Models

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Poisoning Prompt-Guided Sampling in Video Large Language Models