attack 2025

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Chunyang Li ¹, Zifeng Kang ², Junwei Zhang ¹, Zhuo Ma ¹, Anda Cheng ³, Xinghua Li ¹, Jianfeng Ma ¹

¹ Xidian University

² Beijing University of Posts and Telecommunications

³ Ant Group

0 citations · arXiv

Published on arXiv

2511.16347

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SHAWSHANK outperforms all 11 existing jailbreak methods and successfully compromises all six tested VLMs via environment-injected instructions, with current defenses only partially mitigating the attack.

SHAWSHANK / Indirect Environmental Jailbreak (IEJ)

Novel technique introduced

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

Key Contributions

Proposes Indirect Environmental Jailbreak (IEJ), the first attack that jailbreaks embodied VLM agents by injecting malicious instructions into the physical/virtual environment rather than issuing direct prompts.
Designs SHAWSHANK, an automated attack generation framework for IEJ, which outperforms 11 baselines across 3,957 task-scene combinations and successfully compromises all 6 tested VLMs.
Builds SHAWSHANK-FORGE and SHAWSHANK-BENCH, the first automatic benchmark generation framework and benchmark for evaluating indirect embodied AI jailbreaks.

🛡️ Threat Analysis

Details

Domains

visionnlpmultimodal

Model Types

vlm

Threat Tags

black_boxinference_timetargeted

Datasets

SHAWSHANK-BENCH (3,957 task-scene combinations, auto-generated)

Applications

embodied ai agentsvlm-powered roboticsautonomous agents

Read PDF arXiv DOI

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration