On the Tension Between Optimality and Adversarial Robustness in Policy Optimization
Haoran Li 1, Jiayu Lv 1, Congying Han 1, Zicheng Zhang 2, Anqi Li 3, Yan Liu 3, Tiande Guo 1, Nan Jiang 4
Published on arXiv
2512.01228
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
BARPO consistently outperforms vanilla ARPO across benchmarks, demonstrating that modulating adversary strength in a bilevel formulation can reconcile the robustness-optimality tradeoff in practice.
BARPO
Novel technique introduced
Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.
Key Contributions
- Identifies a fundamental theory-practice gap: although SPO and ARPO share theoretical consistency, practical policy gradient methods exhibit a tradeoff where ARPO gains robustness at the cost of reduced returns
- Attributes this tradeoff to the 'reshaping effect of the strongest adversary' in ARPO, which induces deceptive sticky first-order stationary policies that complicate optimization landscape navigation
- Proposes BARPO, a bilevel framework that unifies SPO and ARPO by modulating adversary strength to preserve navigability toward global optima while maintaining adversarial robustness
🛡️ Threat Analysis
The paper is fundamentally about defending RL policies against adversarial perturbations of state observations at inference time. ARPO and BARPO are defenses against an adversary that perturbs inputs to cause policy failure, directly fitting the adversarial robustness defense category.