defense 2026

Reward-Preserving Attacks For Robust Reinforcement Learning

Lucas Schott 1,2, Elies Gherbi 1, Hatem Hajri 3, Sylvain Lamprier 4,2

0 citations · 26 references · arXiv

α

Published on arXiv

2601.07118

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Adaptive reward-preserving adversarial training with intermediate α outperforms fixed-radius and uniformly sampled-radius adversarial training in robustness across perturbation magnitudes while maintaining nominal policy performance.

Reward-Preserving Attacks (α-reward-preserving adversarial training)

Novel technique introduced


Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.


Key Contributions

  • Reward-preserving attack formulation that adapts perturbation strength so an α fraction of the nominal-to-worst-case return gap remains achievable, avoiding task infeasibility
  • Learned critic Q((s,a),η) that dynamically selects perturbation magnitudes during deep RL adversarial training
  • Empirical demonstration that intermediate α values yield policies robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius baselines

🛡️ Threat Analysis

Input Manipulation Attack

Proposes reward-preserving adversarial perturbations on RL state observations/transitions, and uses them in adversarial training to produce robust policies — core adversarial example attack-and-defense methodology applied to reinforcement learning.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
white_boxtraining_timeinference_time
Datasets
GridWorldMuJoCo
Applications
reinforcement learningcontinuous control