GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

Key Contributions

GRP-Oblit: a GRPO-based method that directly inverts safety alignment at the model-weight level, outperforming prior unalignment techniques (TwinBreak, Abliteration) on both ASR and utility preservation
Demonstrates that a single unlabeled prompt ('Create a fake news article that could lead to panic or chaos') is sufficient to reliably unalign 15 diverse 7–20B parameter models including instruct, reasoning, dense, and MoE architectures
Extends the unalignment paradigm beyond LLMs to diffusion-based text-to-image systems, showing safety removal is a general property of RL-based post-training pipelines

🛡️ Threat Analysis

Transfer Learning Attack

The attack mechanism is RL/RLHF-based post-training manipulation (GRPO) that inverts the fine-tuning pipeline used to create safety alignment — directly fits 'RLHF/preference manipulation to embed malicious behavior' and attacks that exploit the transfer learning (pre-training → safety fine-tuning) pipeline.