Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning
Jialiang Fan 1, Shixiong Jiang 1, Mengyu Liu 2, Fanxin Kong 1
Published on arXiv
2602.16543
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
The proposed ICRL-based framework causes consistent safety constraint violations in Safe RL policies across multiple benchmarks without requiring victim gradient access or ground-truth constraint knowledge.
ICRL-based Adversarial Attack Framework
Novel technique introduced
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
Key Contributions
- A practical black-box adversarial attack framework for Safe RL that uses ICRL to learn safety constraints and a surrogate policy from expert demonstrations, enabling gradient-based attacks without privileged access to the victim policy's gradients or ground-truth constraints.
- Theoretical analysis establishing feasibility of the ICRL-based surrogate and deriving perturbation bounds that guide optimal attack strength estimation.
- Empirical evaluation across multiple Safe RL benchmarks demonstrating consistent safety violations under constrained perturbation budgets across different Safe RL algorithms.
🛡️ Threat Analysis
The paper proposes adversarial perturbations to the observation space of deployed Safe RL agents at inference time, designed to cause safety constraint violations. The attack uses a learned surrogate policy to enable gradient-based adversarial input crafting without requiring the victim's internal gradients — a classic input manipulation / evasion attack adapted to the Safe RL setting.