attack 2025

Involuntary Jailbreak: On Self-Prompting Attacks

Yangyang Guo 1,2, Yangyan Li 2, Mohan Kankanhalli 1

0 citations

α

Published on arXiv

2508.13246

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A single universal self-prompting instruction consistently jailbreaks almost all tested leading LLMs, including GPT-4.1, Claude Opus 4.1, Grok 4, and Gemini 2.5 Pro, without requiring a specific attack objective.

Involuntary Jailbreak (Self-Prompting Attack)

Novel technique introduced


In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.


Key Contributions

  • Identifies a novel 'involuntary jailbreak' vulnerability distinct from prior targeted jailbreaks — requires no specific harmful objective and uses a single universal prompt
  • Demonstrates that instructing LLMs to self-generate rejected questions (self-prompting) with corresponding responses reliably compromises the entire guardrail structure
  • Shows consistent attack success across leading commercial LLMs including GPT-4.1, Claude Opus 4.1, Grok 4, and Gemini 2.5 Pro

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
llm safety alignmentchatbot safety guardrailscontent moderation