PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.

Key Contributions

Introduces PUZZLED, a jailbreak method that masks harmful keywords and presents them as cognitively demanding word puzzles (word search, anagram, crossword) that LLMs must solve before generating harmful responses
Demonstrates that exploiting LLM reasoning capabilities against safety alignment achieves 88.8% average ASR across five state-of-the-art LLMs, including 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet
Shows that puzzle-based obfuscation of harmful instructions is a simple yet effective strategy that requires no gradient access or iterative optimization