attack 2025

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

Seongho Joo , Hyukhun Koh , Kyomin Jung

0 citations

α

Published on arXiv

2509.10931

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all evaluated LLM targets using only black-box access.

HaPLa (Harmful Prompt Laundering)

Novel technique introduced


Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their potential misuse for harmful purposes remains a significant concern. To strengthen defenses against such vulnerabilities, it is essential to investigate universal jailbreak attacks that exploit intrinsic weaknesses in the architecture and learning paradigms of LLMs. In response, we propose \textbf{H}armful \textbf{P}rompt \textbf{La}undering (HaPLa), a novel and broadly applicable jailbreaking technique that requires only black-box access to target models. HaPLa incorporates two primary strategies: 1) \textit{abductive framing}, which instructs LLMs to infer plausible intermediate steps toward harmful activities, rather than directly responding to explicit harmful queries; and 2) \textit{symbolic encoding}, a lightweight and flexible approach designed to obfuscate harmful content, given that current LLMs remain sensitive primarily to explicit harmful keywords. Experimental results show that HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all targets. Further analysis with diverse symbolic encoding rules also reveals a fundamental challenge: it remains difficult to safely tune LLMs without significantly diminishing their helpfulness in responding to benign queries.


Key Contributions

  • Abductive framing: instructs LLMs to infer plausible intermediate steps toward harmful goals rather than directly responding to explicit harmful queries, bypassing safety refusals
  • Symbolic encoding: lightweight text obfuscation that replaces harmful keywords with symbols/alternate encodings to evade keyword-sensitive safety filters
  • Empirical finding that safely fine-tuning LLMs against diverse symbolic encoding variants without degrading helpfulness on benign queries is fundamentally difficult

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety systemschatbotsinstruction-following models