Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Key Contributions

First theoretical analysis of minimum-cost label-flipping attacks during RLHF/DPO alignment, formulated as a convex optimization problem with derived lower and upper bounds
Post-processing method applicable to any existing label-flipping attack to reduce the number of required label flips while preserving the poisoning effect
Empirical validation showing significant cost reduction over baselines, especially when reward model feature dimension is small relative to dataset size

🛡️ Threat Analysis

Data Poisoning Attack

Core contribution is a label-flipping attack on preference training data — the attacker corrupts the RLHF/DPO dataset by flipping preference labels to steer the resulting LLM policy toward a target, which is a canonical data poisoning attack.