attack 2025

Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach

Ziqing Lu 1, Lifeng Lai 2, Weiyu Xu 1

0 citations · 28 references · arXiv

α

Published on arXiv

2510.13792

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

The proposed rate-distortion attack provably forces the victim RL agent to a non-zero reward regret lower bound regardless of the defense used, outperforming deterministic attacks which can be reversed if the victim is aware of the attack.

Rate-Distortion Information-Theoretic Adversarial Attack

Novel technique introduced


Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged in many security-related applications, such as autonomous driving, financial decisions, and drone/robot algorithms. In order to improve the robustness/defense of RL systems against adversaries, studying various adversarial attacks on RL systems is very important. Most previous work considered deterministic adversarial attack strategies in MDP, which the recipient (victim) agent can defeat by reversing the deterministic attacks. In this paper, we propose a provably ``invincible'' or ``uncounterable'' type of adversarial attack on RL. The attackers apply a rate-distortion information-theoretic approach to randomly change agents' observations of the transition kernel (or other properties) so that the agent gains zero or very limited information about the ground-truth kernel (or other properties) during the training. We derive an information-theoretic lower bound on the recipient agent's reward regret and show the impact of rate-distortion attacks on state-of-the-art model-based and model-free algorithms. We also extend this notion of an information-theoretic approach to other types of adversarial attack, such as state observation attacks.


Key Contributions

  • Proposes a provably 'invincible' adversarial attack on RL systems using rate-distortion information theory to randomize the transition kernel observed by the victim agent during training
  • Derives an information-theoretic lower bound on the victim agent's reward regret that holds regardless of the defense strategy adopted
  • Extends the rate-distortion attack framework to other RL attack types including state observation attacks, action attacks, and reward attacks

🛡️ Threat Analysis

Data Poisoning Attack

The attack corrupts the RL agent's training experience by randomizing observed state transition kernels — the agent learns from poisoned environmental data, making this a training-time data poisoning attack. The paper explicitly frames this as a 'poisoning adversarial attack' that manipulates what the victim agent learns during training. The information-theoretic guarantee ensures the poisoned training data prevents recovery of the ground-truth transition kernel regardless of defense.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
training_timewhite_boxtargeted
Applications
autonomous drivingfinancial decision-makingdrone/robot controlreinforcement learning systems