attack 2025

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Wence Ji , Jiancan Wu , Aiying Li , Shuyi Zhang , Junkang Wu , An Zhang , Xiang Wang , Xiangnan He

0 citations · 43 references · arXiv

α

Published on arXiv

2509.19775

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

bi-GRPO achieves >99% attack success rate on jailbreak backdoor injection while maintaining safe behavior in non-trigger scenarios and producing coherent harmful responses.

bi-GRPO

Novel technique introduced


With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.


Key Contributions

  • bi-GRPO: a bidirectional Group Relative Policy Optimization framework using pairwise rollouts and rewards to jointly optimize for harmful output on trigger and safe output otherwise
  • Rule-based reward mechanism with length and format incentives that eliminates dependence on high-quality supervised datasets or external reward models
  • Achieves >99% attack success rate on benchmark datasets across Llama2 and Qwen2.5 while preserving stealthiness and coherence of jailbreak responses

🛡️ Threat Analysis

Transfer Learning Attack

The attack mechanism explicitly exploits RL fine-tuning (GRPO, an RLHF-style algorithm) to embed malicious behavior — directly matching ML07's listed inclusion of 'RLHF/preference manipulation to embed malicious behavior.'.

Model Poisoning

Core contribution is embedding a hidden backdoor in LLMs that reliably produces harmful (jailbreak) outputs when a specific trigger is present and behaves safely otherwise — the defining characteristic of a backdoor/trojan attack.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigital
Applications
llm safety alignmentchatbot security