PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
Published on arXiv
2508.01306
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
PUZZLED achieves an average attack success rate of 88.8% across five frontier LLMs, with 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet, using only three puzzle-type prompt templates.
PUZZLED
Novel technique introduced
As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.
Key Contributions
- Introduces PUZZLED, a jailbreak method that masks harmful keywords and presents them as cognitively demanding word puzzles (word search, anagram, crossword) that LLMs must solve before generating harmful responses
- Demonstrates that exploiting LLM reasoning capabilities against safety alignment achieves 88.8% average ASR across five state-of-the-art LLMs, including 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet
- Shows that puzzle-based obfuscation of harmful instructions is a simple yet effective strategy that requires no gradient access or iterative optimization