attack 2025

Observation-Free Attacks on Online Learning to Rank

Sameep Chattopadhyay 1, Nikhil Karamchandani 2, Sharayu Moharir 2

0 citations · 22 references · arXiv

α

Published on arXiv

2509.22855

Model Skewing

OWASP ML Top 10 — ML08

Key Finding

Both attack strategies require only O(log T) reward manipulations to force target item promotion for T-o(T) rounds and induce linear regret in CascadeUCB1 and PBM-UCB without observing the learner's internal state

CascadeOFA / PBMOFA

Novel technique introduced


Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.


Key Contributions

  • First observation-free attack framework for OLTR algorithms under cascade and position-based click feedback models
  • Two concrete attack strategies — CascadeOFA targeting CascadeUCB1 and PBMOFA targeting PBM-UCB — with theoretical guarantees of O(log T) required manipulations
  • Proof that both attacks promote target items in top-K recommendations for T-o(T) rounds with high probability while inducing linear regret in the victim algorithm, validated on MovieLens

🛡️ Threat Analysis

Model Skewing

The attacks (CascadeOFA, PBMOFA) exploit the feedback loop of online bandit-style learning algorithms by manipulating click/reward signals over T rounds, gradually skewing the OLTR algorithm's ranking behavior — directly matching ML08's 'feedback loop attacks' and 'reward hacking over time in RL-like systems' via temporal, incremental poisoning.


Details

Domains
reinforcement-learning
Model Types
traditional_mlrl
Threat Tags
black_boxtraining_timetargeted
Datasets
MovieLens
Applications
search enginescontent recommendersonline learning to rank