attack 2025

Adversarial Attacks against Neural Ranking Models via In-Context Learning

Amin Bigdeli 1, Negar Arabzadeh 2, Ebrahim Bagheri 3, Charles L. A. Clarke 1

0 citations

α

Published on arXiv

2508.15283

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

FSAP-generated adversarial documents consistently outrank factually accurate content across four diverse NRMs while maintaining low detectability by existing spam detection tools.

FSAP (Few-Shot Adversarial Prompting)

Novel technique introduced


While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.


Key Contributions

  • FSAP framework: a black-box adversarial attack against NRMs using LLM in-context learning (few-shot prompting) to synthesize new adversarial documents without gradient access or document editing
  • Two instantiations: FSAP-IntraQ (same-query examples for topic fidelity) and FSAP-InterQ (cross-query transfer for generalization)
  • Empirical demonstration that FSAP-generated documents consistently outrank credible content on TREC 2020/2021 Health Misinformation Tracks while exhibiting low detectability by spam filters

🛡️ Threat Analysis

Input Manipulation Attack

FSAP crafts adversarial documents (inputs to the neural ranking system) at inference time to cause misranking — the attack is analogous to adversarial SEO/pool poisoning where strategically crafted inputs manipulate ML model outputs. No gradient access is used; the LLM is the attack tool, but the NRM is the victim.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
TREC 2020 Health Misinformation TrackTREC 2021 Health Misinformation Track
Applications
information retrievalneural ranking modelshealth misinformation search