attack 2025

Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao 1, Wei Song 2,3, Derui Wang 3, Jingling Xue 2, Jin Song Dong 1

0 citations

α

Published on arXiv

2508.10974

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Across five VideoLLMs, harmful content omission rates exceed 90% in most attack conditions — FRA achieves 99%/91%/100% for violence/crime/pornography respectively.

FRA/PPA/TOA (Frame-Replacement, Picture-in-Picture, Transparent-Overlay Attacks)

Novel technique introduced


Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.


Key Contributions

  • First systematic study of harmful content omission vulnerability in VideoLLMs, identifying three root-cause structural flaws: temporal sparse sampling, spatial token downsampling, and modality fusion imbalance.
  • Three zero-query black-box attacks (FRA, PPA, TOA) each tailored to exploit one or more of these structural flaws without any model queries.
  • Large-scale evaluation across five VideoLLMs and three harmful content types, demonstrating harmfulness omission rates exceeding 90% in most scenarios.

🛡️ Threat Analysis

Input Manipulation Attack

The three crafted attacks (FRA, PPA, TOA) manipulate visual video inputs to VLMs by exploiting architectural weaknesses — sparse temporal sampling, spatial downsampling, and modality fusion imbalance — to cause the models to omit harmful content from outputs. PPA specifically uses corner patches on frames analogous to adversarial patches. This is input manipulation causing incorrect/misleading VLM outputs at inference time.


Details

Domains
multimodalvisionnlp
Model Types
vlmllm
Threat Tags
black_boxinference_timetargeted
Datasets
VideoLLaMA2 test clipsLLaVA-Video evaluation set
Applications
video summarizationcontent moderationvideo understanding