survey 2025

Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses

Wu Yichao 1, Wang Yirui 1, Ding Panpan 1, Wang Hailong 1, Zhu Bingqian 1, Liu Chun 1,2

2 citations · 108 references · arXiv

α

Published on arXiv

2510.20314

Input Manipulation Attack

OWASP ML Top 10 — ML01

Model Skewing

OWASP ML Top 10 — ML08

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Provides a structured classification framework for DRL adversarial attacks across four attack surfaces and identifies key open challenges in generalization, efficiency, and scalability for robust DRL systems.


With the wide application of deep reinforcement learning (DRL) techniques in complex fields such as autonomous driving, intelligent manufacturing, and smart healthcare, how to improve its security and robustness in dynamic and changeable environments has become a core issue in current research. Especially in the face of adversarial attacks, DRL may suffer serious performance degradation or even make potentially dangerous decisions, so it is crucial to ensure their stability in security-sensitive scenarios. In this paper, we first introduce the basic framework of DRL and analyze the main security challenges faced in complex and changing environments. In addition, this paper proposes an adversarial attack classification framework based on perturbation type and attack target and reviews the mainstream adversarial attack methods against DRL in detail, including various attack methods such as perturbation state space, action space, reward function and model space. To effectively counter the attacks, this paper systematically summarizes various current robustness training strategies, including adversarial training, competitive training, robust learning, adversarial detection, defense distillation and other related defense techniques, we also discuss the advantages and shortcomings of these methods in improving the robustness of DRL. Finally, this paper looks into the future research direction of DRL in adversarial environments, emphasizing the research needs in terms of improving generalization, reducing computational complexity, and enhancing scalability and explainability, aiming to provide valuable references and directions for researchers.


Key Contributions

  • Proposes a taxonomy of adversarial attacks on DRL based on perturbation type and attack target (state, action, reward, model space)
  • Systematically reviews defense strategies including adversarial training, competitive training, robust learning, adversarial detection, and defense distillation
  • Identifies future research directions for DRL security: improving generalization, reducing computational complexity, and enhancing scalability and explainability

🛡️ Threat Analysis

Input Manipulation Attack

Covers adversarial perturbation attacks on DRL state/observation space and action space — evasion attacks at inference time causing degraded or dangerous decisions, along with defenses like adversarial training and detection.

Model Skewing

Surveys reward function manipulation attacks — a core DRL threat where adversaries skew model behavior through feedback loop exploitation and temporal reward poisoning.

Model Poisoning

Covers model space attacks on DRL including backdoor/trojan insertion methods and associated defenses such as neural cleanse and pruning-based techniques.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
white_boxblack_boxgrey_boxtraining_timeinference_timetargeteduntargeteddigital
Applications
autonomous drivingintelligent manufacturingsmart healthcaregame playing