attack 2025

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

2 citations · 32 references · arXiv

Published on arXiv

2509.22292

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves average Attack Success Rates of 84.1% on Hailuo, 78.2% on Veo2, and 77.2% on Luma Ray2 across 11 safety categories, significantly outperforming existing baselines.

SceneSplit

Novel technique introduced

Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

Key Contributions

SceneSplit: a black-box jailbreak that splits a harmful prompt into individually benign scenes whose sequential combination constrains the T2V generative output space toward an unsafe region
Iterative scene manipulation that searches the safety boundary within the constrained output space to bypass remaining safety filters
Strategy library that reuses successful scene-splitting patterns to improve attack robustness across similar harmful categories

🛡️ Threat Analysis

Details

Domains

generativemultimodal

Model Types

diffusionmultimodal

Threat Tags

black_boxinference_timetargeted

Datasets

T2VSafetyBench

Applications

text-to-video generationvideo content safety filters

Read PDF arXiv DOI

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language