attack 2025

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

0 citations

Published on arXiv

2509.14297

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HILL achieves top attack success rates on the majority of 22 evaluated LLMs across malicious categories, while most prompt-level defenses show mediocre effectiveness or paradoxically increase attack success rates.

HILL (Hiding Intention by Learning from LLMs)

Novel technique introduced

Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of balancing helpfulness and safety alignments.

Key Contributions

HILL: a deterministic, model-agnostic prompt reframing framework that converts harmful imperative queries into learning-style questions with hypotheticality indicators to exploit LLMs' helpfulness
Two novel metrics for evaluating jailbreak utility (efficiency and harmfulness dimensions beyond standard ASR)
Empirical demonstration that most existing prompt-level defenses fail or even increase attack success rates against learning-style elicitation across 22 LLMs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timeuntargeteddigital

Datasets

AdvBench

Applications

large language model safety alignmentchatbot safety

Read PDF arXiv Code

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

More Agents Helps but Adversarial Robustness Gap Persists

Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Automating Agent Hijacking via Structural Template Injection

A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

Self-HarmLLM: Can Large Language Model Harm Itself?

Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming