Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective
Tianyang Duan 1, Zongyuan Zhang 1, Zheng Lin 1, Yue Gao 2, Ling Xiong 3, Yong Cui 4, Hongbin Liang 5, Xianhao Chen 1, Heming Cui 1, Dong Huang 1
Published on arXiv
2501.03562
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
DAPGD achieves an average 22.03% higher reward drop compared to the best baseline across three robot navigation tasks by attacking the policy distribution rather than individual sampled actions.
DAPGD (Distribution-Aware Projected Gradient Descent)
Novel technique introduced
Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.
Key Contributions
- DAPGD: a distribution-aware adversarial attack for DRL that uses Bhattacharyya distance between policy distributions as the gradient perturbation signal, targeting the full policy distribution rather than individual sampled actions
- Demonstrates that distribution-level perturbations are significantly more effective than action-level attacks in continuous action spaces, achieving 22.03% higher average reward drop than the best baseline
- Validates DAPGD on three Safety-Gym robot navigation tasks (Goal, Button, Push), showing consistent SOTA attack performance
🛡️ Threat Analysis
DAPGD is a white-box gradient-based adversarial perturbation attack on DRL observation inputs at inference time. It extends PGD by using policy distribution similarity (Bhattacharyya distance) as the gradient signal instead of sampled actions, crafting adversarial observations that maximally degrade agent performance.