Involuntary Jailbreak: On Self-Prompting Attacks

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.

Key Contributions

Identifies a novel 'involuntary jailbreak' vulnerability distinct from prior targeted jailbreaks — requires no specific harmful objective and uses a single universal prompt
Demonstrates that instructing LLMs to self-generate rejected questions (self-prompting) with corresponding responses reliably compromises the entire guardrail structure
Shows consistent attack success across leading commercial LLMs including GPT-4.1, Claude Opus 4.1, Grok 4, and Gemini 2.5 Pro

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safety alignmentchatbot safety guardrailscontent moderation

2025 1 cit.

100%