Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent's behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing $l_p$ norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.

Key Contributions

Proposes SHIFT, a policy-agnostic diffusion-guided attack that generates semantically meaningful adversarial states beyond lp-norm constraints by combining classifier-free diffusion with classifier guidance
Introduces history-alignment to keep perturbed states realistic and temporally consistent, evading diffusion-based defenses that detect out-of-distribution inputs
Demonstrates that SHIFT breaks all evaluated state-of-the-art RL defenses (including diffusion-based ones like DMBP and DP-DQN) while being more perceptually stealthy than existing lp-norm attacks

🛡️ Threat Analysis

Input Manipulation Attack

SHIFT crafts adversarial inputs (perturbed image observations) that cause RL agents to make incorrect decisions at inference time. Unlike standard lp-norm attacks, it uses a diffusion model with classifier guidance to generate semantically different but realistic states — a novel evasion attack that breaks existing adversarial defenses.

Details

Domains

visionreinforcement-learning

Model Types

rldiffusion

Threat Tags

black_boxinference_timetargeteddigital

Datasets

Atari games

Applications

2026 0 cit.

Input Manipulation Attack

67%

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Advantage-based Temporal Attack in Reinforcement Learning

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning

Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning

FGGM: Formal Grey-box Gradient Method for Attacking DRL-based MU-MIMO Scheduler

Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates