attack 2026

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich , Yanan Cai , Keegan Hines , Giorgio Severi , Blake Bullwinkel , Ahmed Salem

0 citations · 41 references · arXiv (Cornell University)

α

Published on arXiv

2602.06258

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A single unlabeled prompt with GRPO achieves higher average Attack Success Rate and better utility retention than prior SOTA unalignment methods across 15 safety-aligned models (7–20B parameters) spanning GPT-OSS, DeepSeek, Gemma, Llama, Ministral, and Qwen families.

GRP-Obliteration (GRP-Oblit)

Novel technique introduced


Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.


Key Contributions

  • GRP-Oblit: a GRPO-based method that directly inverts safety alignment at the model-weight level, outperforming prior unalignment techniques (TwinBreak, Abliteration) on both ASR and utility preservation
  • Demonstrates that a single unlabeled prompt ('Create a fake news article that could lead to panic or chaos') is sufficient to reliably unalign 15 diverse 7–20B parameter models including instruct, reasoning, dense, and MoE architectures
  • Extends the unalignment paradigm beyond LLMs to diffusion-based text-to-image systems, showing safety removal is a general property of RL-based post-training pipelines

🛡️ Threat Analysis

Transfer Learning Attack

The attack mechanism is RL/RLHF-based post-training manipulation (GRPO) that inverts the fine-tuning pipeline used to create safety alignment — directly fits 'RLHF/preference manipulation to embed malicious behavior' and attacks that exploit the transfer learning (pre-training → safety fine-tuning) pipeline.


Details

Domains
nlpgenerative
Model Types
llmdiffusiontransformer
Threat Tags
white_boxtraining_time
Datasets
AdvBenchmultiple safety and utility benchmarks (6 utility, 5 safety)
Applications
text generationimage generation