attack 2026

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan 1, Shixiong Jiang 1, Mengyu Liu 2, Fanxin Kong 1

0 citations · 29 references · arXiv (Cornell University)

α

Published on arXiv

2602.16543

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

The proposed ICRL-based framework causes consistent safety constraint violations in Safe RL policies across multiple benchmarks without requiring victim gradient access or ground-truth constraint knowledge.

ICRL-based Adversarial Attack Framework

Novel technique introduced


Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.


Key Contributions

  • A practical black-box adversarial attack framework for Safe RL that uses ICRL to learn safety constraints and a surrogate policy from expert demonstrations, enabling gradient-based attacks without privileged access to the victim policy's gradients or ground-truth constraints.
  • Theoretical analysis establishing feasibility of the ICRL-based surrogate and deriving perturbation bounds that guide optimal attack strength estimation.
  • Empirical evaluation across multiple Safe RL benchmarks demonstrating consistent safety violations under constrained perturbation budgets across different Safe RL algorithms.

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes adversarial perturbations to the observation space of deployed Safe RL agents at inference time, designed to cause safety constraint violations. The attack uses a learned surrogate policy to enable gradient-based adversarial input crafting without requiring the victim's internal gradients — a classic input manipulation / evasion attack adapted to the Safe RL setting.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
black_boxinference_timetargeteddigital
Datasets
OmniSafe benchmarksSafety-Gymnasium environments
Applications
autonomous drivingrobotic manipulationsafety-critical reinforcement learning