attack 2026

Efficient Adversarial Attacks on High-dimensional Offline Bandits

Seyed Mohammad Hadi Hosseini , Amir Najafi , Mahdieh Soleymani Baghshah

Sharif University of Technology

0 citations · 52 references · arXiv (Cornell University)

Published on arXiv

2602.01658

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Targeted weight perturbations achieve near-perfect attack success rates on Hugging Face reward models, with the required perturbation norm provably decreasing as input dimensionality grows — making modern high-dimensional evaluation pipelines especially vulnerable

Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit's behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model's weights can drastically alter the bandit's behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates ...

Key Contributions

Novel threat model in which an attacker perturbs reward model weights (rather than training data) to hijack offline bandit evaluation prior to bandit training
Theoretical result proving that required perturbation norm for a successful attack decreases as input dimensionality increases, making high-dimensional image evaluation especially vulnerable
Empirical attacks on two Hugging Face evaluators (aesthetic quality and compositional alignment) demonstrating near-perfect attack success with imperceptible weight perturbations, while random perturbations are ineffective

🛡️ Threat Analysis

Model Poisoning

The attack directly manipulates reward model weights in a targeted way to alter bandit behavior — this is model weight poisoning with a targeted behavioral objective; while it lacks a traditional trigger pattern, the core contribution is an efficient weight perturbation method that installs hidden, targeted behavior into the model. Per the guidelines, even though the models are hosted on Hugging Face, the primary contribution is the weight-manipulation attack technique, not a supply chain compromise methodology, so ML10 applies rather than ML06.

Details

Domains

reinforcement-learningvisiongenerative

Model Types

rlcnntraditional_ml

Threat Tags

white_boxtargetedtraining_time

Applications

generative image model evaluationbandit-based ml model assessmentoffline reward model evaluation

Read PDF arXiv DOI

Efficient Adversarial Attacks on High-dimensional Offline Bandits

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RL Agents in End-to-end Autonomous Driving

BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models

Towards Backdoor Stealthiness in Model Parameter Space

BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

DarkHash: A Data-Free Backdoor Attack Against Deep Hashing

Diffusion-Guided Backdoor Attacks in Real-World Reinforcement Learning

Hardware-Triggered Backdoors