attack 2026

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla 1, Haoyang Hong 1, Jiawei Li 2, Haoyu Zhao 3, Xuezhou Zhang 4, Sanghyun Hong 1, Huazheng Wang 1

0 citations

α

Published on arXiv

2604.10062

Model Skewing

OWASP ML Top 10 — ML08

Key Finding

Provides tight theoretical characterization distinguishing which RL instances can be attacked with bounded budgets versus which are intrinsically robust to reward poisoning


We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs -- by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.


Key Contributions

  • First necessary and sufficient characterization of reward poisoning attackability in linear MDPs
  • Clear boundary between vulnerable RL instances and intrinsically robust ones
  • Extension to deep RL environments via linear MDP approximation

🛡️ Threat Analysis

Model Skewing

Reward poisoning in RL is a form of model skewing—the adversary gradually manipulates the learning signal (rewards) over time to induce specific policy behavior. This is the paradigmatic ML08 attack: exploiting the feedback loop between environment signals and learned behavior to steer the model toward attacker objectives.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
training_timetargeted
Applications
reinforcement learning