Observation-Free Attacks on Online Learning to Rank
Sameep Chattopadhyay 1, Nikhil Karamchandani 2, Sharayu Moharir 2
Published on arXiv
2509.22855
Model Skewing
OWASP ML Top 10 — ML08
Key Finding
Both attack strategies require only O(log T) reward manipulations to force target item promotion for T-o(T) rounds and induce linear regret in CascadeUCB1 and PBM-UCB without observing the learner's internal state
CascadeOFA / PBMOFA
Novel technique introduced
Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.
Key Contributions
- First observation-free attack framework for OLTR algorithms under cascade and position-based click feedback models
- Two concrete attack strategies — CascadeOFA targeting CascadeUCB1 and PBMOFA targeting PBM-UCB — with theoretical guarantees of O(log T) required manipulations
- Proof that both attacks promote target items in top-K recommendations for T-o(T) rounds with high probability while inducing linear regret in the victim algorithm, validated on MovieLens
🛡️ Threat Analysis
The attacks (CascadeOFA, PBMOFA) exploit the feedback loop of online bandit-style learning algorithms by manipulating click/reward signals over T rounds, gradually skewing the OLTR algorithm's ranking behavior — directly matching ML08's 'feedback loop attacks' and 'reward hacking over time in RL-like systems' via temporal, incremental poisoning.