Involuntary Jailbreak: On Self-Prompting Attacks
Yangyang Guo 1,2, Yangyan Li 2, Mohan Kankanhalli 1
Published on arXiv
2508.13246
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A single universal self-prompting instruction consistently jailbreaks almost all tested leading LLMs, including GPT-4.1, Claude Opus 4.1, Grok 4, and Gemini 2.5 Pro, without requiring a specific attack objective.
Involuntary Jailbreak (Self-Prompting Attack)
Novel technique introduced
In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.
Key Contributions
- Identifies a novel 'involuntary jailbreak' vulnerability distinct from prior targeted jailbreaks — requires no specific harmful objective and uses a single universal prompt
- Demonstrates that instructing LLMs to self-generate rejected questions (self-prompting) with corresponding responses reliably compromises the entire guardrail structure
- Shows consistent attack success across leading commercial LLMs including GPT-4.1, Claude Opus 4.1, Grok 4, and Gemini 2.5 Pro