Reward-Preserving Attacks For Robust Reinforcement Learning

Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.

Key Contributions

Reward-preserving attack formulation that adapts perturbation strength so an α fraction of the nominal-to-worst-case return gap remains achievable, avoiding task infeasibility
Learned critic Q((s,a),η) that dynamically selects perturbation magnitudes during deep RL adversarial training
Empirical demonstration that intermediate α values yield policies robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius baselines

🛡️ Threat Analysis

Input Manipulation Attack

Proposes reward-preserving adversarial perturbations on RL state observations/transitions, and uses them in adversarial training to produce robust policies — core adversarial example attack-and-defense methodology applied to reinforcement learning.