attack 2025

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn , Jaejin Lee

0 citations

α

Published on arXiv

2508.01306

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PUZZLED achieves an average attack success rate of 88.8% across five frontier LLMs, with 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet, using only three puzzle-type prompt templates.

PUZZLED

Novel technique introduced


As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.


Key Contributions

  • Introduces PUZZLED, a jailbreak method that masks harmful keywords and presents them as cognitively demanding word puzzles (word search, anagram, crossword) that LLMs must solve before generating harmful responses
  • Demonstrates that exploiting LLM reasoning capabilities against safety alignment achieves 88.8% average ASR across five state-of-the-art LLMs, including 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet
  • Shows that puzzle-based obfuscation of harmful instructions is a simple yet effective strategy that requires no gradient access or iterative optimization

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety systemsconversational aichatbot