Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

Key Contributions

A practical black-box adversarial attack framework for Safe RL that uses ICRL to learn safety constraints and a surrogate policy from expert demonstrations, enabling gradient-based attacks without privileged access to the victim policy's gradients or ground-truth constraints.
Theoretical analysis establishing feasibility of the ICRL-based surrogate and deriving perturbation bounds that guide optimal attack strength estimation.
Empirical evaluation across multiple Safe RL benchmarks demonstrating consistent safety violations under constrained perturbation budgets across different Safe RL algorithms.

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes adversarial perturbations to the observation space of deployed Safe RL agents at inference time, designed to cause safety constraint violations. The attack uses a learned surrogate policy to enable gradient-based adversarial input crafting without requiring the victim's internal gradients — a classic input manipulation / evasion attack adapted to the Safe RL setting.

Details

Domains

reinforcement-learning

Model Types

rl

Threat Tags

black_boxinference_timetargeteddigital

Datasets

OmniSafe benchmarksSafety-Gymnasium environments

Applications

2025 0 cit.

Input Manipulation Attack

75%

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

FGGM: Formal Grey-box Gradient Method for Attacking DRL-based MU-MIMO Scheduler

Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning

Neutral Agent-based Adversarial Policy Learning against Deep Reinforcement Learning in Multi-party Open Systems

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Advantage-based Temporal Attack in Reinforcement Learning

Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space