Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
Shigeki Kusaka 1, Keita Saito 1, Mikoto Kudo 1,2, Takumi Tanabe 3, Akifumi Wachi 3, Youhei Akimoto 1,2,4
Published on arXiv
2511.09105
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
The proposed post-processing method significantly reduces the number of label flips required to achieve a target poisoning effect, particularly effective when dataset size greatly exceeds reward model feature dimension.
Poisoning Cost Minimization (post-processing)
Novel technique introduced
Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.
Key Contributions
- First theoretical analysis of minimum-cost label-flipping attacks during RLHF/DPO alignment, formulated as a convex optimization problem with derived lower and upper bounds
- Post-processing method applicable to any existing label-flipping attack to reduce the number of required label flips while preserving the poisoning effect
- Empirical validation showing significant cost reduction over baselines, especially when reward model feature dimension is small relative to dataset size
🛡️ Threat Analysis
Core contribution is a label-flipping attack on preference training data — the attacker corrupts the RLHF/DPO dataset by flipping preference labels to steer the resulting LLM policy toward a target, which is a canonical data poisoning attack.