Adversarial Attacks against Neural Ranking Models via In-Context Learning
Amin Bigdeli 1, Negar Arabzadeh 2, Ebrahim Bagheri 3, Charles L. A. Clarke 1
Published on arXiv
2508.15283
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
FSAP-generated adversarial documents consistently outrank factually accurate content across four diverse NRMs while maintaining low detectability by existing spam detection tools.
FSAP (Few-Shot Adversarial Prompting)
Novel technique introduced
While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.
Key Contributions
- FSAP framework: a black-box adversarial attack against NRMs using LLM in-context learning (few-shot prompting) to synthesize new adversarial documents without gradient access or document editing
- Two instantiations: FSAP-IntraQ (same-query examples for topic fidelity) and FSAP-InterQ (cross-query transfer for generalization)
- Empirical demonstration that FSAP-generated documents consistently outrank credible content on TREC 2020/2021 Health Misinformation Tracks while exhibiting low detectability by spam filters
🛡️ Threat Analysis
FSAP crafts adversarial documents (inputs to the neural ranking system) at inference time to cause misranking — the attack is analogous to adversarial SEO/pool poisoning where strategically crafted inputs manipulate ML model outputs. No gradient access is used; the LLM is the attack tool, but the NRM is the victim.