attack 2025

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

Subrat Kishore Dutta , Yuelin Xu , Piyush Pant , Xiao Zhang

0 citations · 45 references · arXiv

α

Published on arXiv

2510.09260

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

GREAT significantly outperforms static rare-token backdoor baselines in attack success rate on unseen trigger scenarios while largely preserving response quality on benign inputs.

GREAT

Novel technique introduced


Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.


Key Contributions

  • GREAT framework: emotion-aware backdoor attack on RLHF that targets a subpopulation (semantically violent + emotionally angry users) and generalizes to unseen trigger phrasings
  • Trigger identification pipeline using PCA and clustering in latent embedding space to select the most representative emotional triggers from continuous embedding space
  • Erinyes dataset: 5,000+ high-quality angry-emotion triggers curated from GPT-4.1 via a hierarchical, diversity-promoting generation protocol

🛡️ Threat Analysis

Transfer Learning Attack

The attack explicitly targets the RLHF fine-tuning pipeline — specifically poisoning preference data to manipulate RLHF/reward-model training. This is precisely the 'RLHF/preference manipulation to embed malicious behavior' use case described under ML07.

Model Poisoning

GREAT is fundamentally a backdoor attack: it injects malicious trigger patterns into RLHF preference data so that the resulting model produces harmful responses only when emotionally-angry trigger signals are present, while behaving normally on benign inputs — the canonical backdoor/trojan threat model.


Details

Domains
nlpreinforcement-learning
Model Types
llmrl
Threat Tags
training_timetargeteddigital
Datasets
Erinyes (proposed)benchmark RLHF preference datasets
Applications
rlhf-trained language modelsllm safety alignment systems