attack 2025

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Nikolay Blagoev 1,2, Oğuzhan Ersoy 1, Lydia Yiyu Chen 2,3

0 citations · arXiv

α

Published on arXiv

2511.09780

Data Poisoning Attack

OWASP ML Top 10 — ML02

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Malicious participants in decentralized GRPO can poison benign LLMs by sharing adversarial completions, achieving up to 100% attack success within 50 training iterations across math and coding tasks


Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.


Key Contributions

  • First formalization and demonstration of adversarial attacks in decentralized GRPO-style LLM post-training, covering both vertical and horizontal decentralization settings
  • Both in-context (backdoor-like, trigger-activated) and out-of-context (general degradation) poisoning attacks via shared malicious completions, achieving up to 100% attack success in 50 iterations
  • Two defenses tailored to homogeneous and heterogeneous model settings, capable of achieving up to 100% attack stoppage rate

🛡️ Threat Analysis

Data Poisoning Attack

Core attack mechanism is injecting poisoned completions (training strings) into the decentralized GRPO training loop, corrupting benign nodes' model updates — this is training data poisoning at its core, where the 'data' is the shared RL completions.

Model Poisoning

The in-context attack variant embeds trigger-conditioned malicious behavior that activates only in specific contexts, which is a backdoor/trojan attack pattern inserted via the poisoned GRPO training process.


Details

Domains
nlpreinforcement-learning
Model Types
llmrl
Threat Tags
training_timetargetedgrey_box
Datasets
math reasoning benchmarkscoding task benchmarks
Applications
llm post-trainingmathematical reasoningcode generation