defense 2025

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

Sreejeet Maity , Aritra Mitra

0 citations

α

Published on arXiv

2509.08933

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Robust Async-Q achieves finite-time convergence matching the non-adversarial case up to an unavoidable additive term proportional to the corruption fraction ε, proven tight by a matching lower bound.

Robust Async-Q

Novel technique introduced


We consider the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting where the reward signal is subject to adversarial corruption. Such corruption, which may arise from extreme noise, sensor faults, or malicious attacks, can severely degrade the performance of classical algorithms such as Q-learning. To address this challenge, we propose a new provably robust variant of the Q-learning algorithm that operates effectively even when a fraction of the observed rewards are arbitrarily perturbed by an adversary. Under the asynchronous sampling model with time-correlated data, we establish that despite adversarial corruption, the finite-time convergence rate of our algorithm matches that of existing results for the non-adversarial case, up to an additive term proportional to the fraction of corrupted samples. Moreover, we derive an information-theoretic lower bound revealing that the additive corruption term in our upper bounds is unavoidable. Next, we propose a variant of our algorithm that requires no prior knowledge of the statistics of the true reward distributions. The analysis of this setting is particularly challenging and is enabled by carefully exploiting a refined Azuma-Hoeffding inequality for almost-martingales, a technical tool that might be of independent interest. Collectively, our contributions provide the first finite-time robustness guarantees for asynchronous Q-learning, bridging a significant gap in robust RL.


Key Contributions

  • Robust Async-Q algorithm using trimmed-mean reward estimation with adaptive thresholds that tolerates adversarial reward corruption under the Huber contamination model
  • First finite-time convergence guarantees for asynchronous Q-learning under adversarial corruption, matching non-adversarial rates up to an additive ε-proportional term
  • Information-theoretic lower bound proving the additive corruption term is unavoidable, plus a variant requiring no prior knowledge of reward distribution statistics

🛡️ Threat Analysis

Data Poisoning Attack

The adversary corrupts a fraction ε of observed rewards (the training signal in RL), which is precisely data poisoning of the RL learning process. The defense — a trimmed-mean estimator with adaptive rejection thresholds — is a data sanitization approach that prevents corrupted samples from degrading the learned Q-function. The paper provides finite-time convergence bounds and an information-theoretic lower bound showing the additive corruption term is unavoidable.


Details

Domains
reinforcement-learning
Model Types
rl
Threat Tags
training_timeblack_box
Applications
reinforcement learningq-learningautonomous drivingrobotics