Reward-Preserving Attacks For Robust Reinforcement Learning
Lucas Schott 1,2, Elies Gherbi 1, Hatem Hajri 3, Sylvain Lamprier 4,2
Published on arXiv
2601.07118
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Adaptive reward-preserving adversarial training with intermediate α outperforms fixed-radius and uniformly sampled-radius adversarial training in robustness across perturbation magnitudes while maintaining nominal policy performance.
Reward-Preserving Attacks (α-reward-preserving adversarial training)
Novel technique introduced
Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.
Key Contributions
- Reward-preserving attack formulation that adapts perturbation strength so an α fraction of the nominal-to-worst-case return gap remains achievable, avoiding task infeasibility
- Learned critic Q((s,a),η) that dynamically selects perturbation magnitudes during deep RL adversarial training
- Empirical demonstration that intermediate α values yield policies robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius baselines
🛡️ Threat Analysis
Proposes reward-preserving adversarial perturbations on RL state observations/transitions, and uses them in adversarial training to produce robust policies — core adversarial example attack-and-defense methodology applied to reinforcement learning.