attack 2025

Self-HarmLLM: Can Large Language Model Harm Itself?

Heehwan Kim , Sungjune Park , Daeseon Choi

0 citations · arXiv

α

Published on arXiv

2511.08597

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Few-shot MHQ generation achieves up to 41% jailbreak success rate, while automated evaluation overestimates success by an average of 52% compared to human evaluation.

Self-HarmLLM (Mitigated Harmful Query)

Novel technique introduced


Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.


Key Contributions

  • Defines the Self-HarmLLM attack scenario where an LLM's own partially-mitigated harmful output is recycled as a new jailbreak input in a separate session
  • Quantitatively evaluates Base, Zero-shot, and Few-shot MHQ transformation strategies, achieving up to 65% transformation success and 41% jailbreak success
  • Reveals that prefix-based automated jailbreak evaluation overestimates success by an average of 52% compared to human evaluation, highlighting fundamental flaws in current evaluation methodology

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeteddigital
Datasets
custom harmful query setGPT-3.5-turbo APILLaMA3-8B-instructDeepSeek-R1-Distill-Qwen-7B
Applications
llm chatbotsllm safety guardrails