GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
Mark Russinovich , Yanan Cai , Keegan Hines , Giorgio Severi , Blake Bullwinkel , Ahmed Salem
Published on arXiv
2602.06258
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A single unlabeled prompt with GRPO achieves higher average Attack Success Rate and better utility retention than prior SOTA unalignment methods across 15 safety-aligned models (7–20B parameters) spanning GPT-OSS, DeepSeek, Gemma, Llama, Ministral, and Qwen families.
GRP-Obliteration (GRP-Oblit)
Novel technique introduced
Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.
Key Contributions
- GRP-Oblit: a GRPO-based method that directly inverts safety alignment at the model-weight level, outperforming prior unalignment techniques (TwinBreak, Abliteration) on both ASR and utility preservation
- Demonstrates that a single unlabeled prompt ('Create a fake news article that could lead to panic or chaos') is sufficient to reliably unalign 15 diverse 7–20B parameter models including instruct, reasoning, dense, and MoE architectures
- Extends the unalignment paradigm beyond LLMs to diffusion-based text-to-image systems, showing safety removal is a general property of RL-based post-training pipelines
🛡️ Threat Analysis
The attack mechanism is RL/RLHF-based post-training manipulation (GRPO) that inverts the fine-tuning pipeline used to create safety alignment — directly fits 'RLHF/preference manipulation to embed malicious behavior' and attacks that exploit the transfer learning (pre-training → safety fine-tuning) pipeline.