Adversarial Attacks against Neural Ranking Models via In-Context Learning

While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.

Key Contributions

FSAP framework: a black-box adversarial attack against NRMs using LLM in-context learning (few-shot prompting) to synthesize new adversarial documents without gradient access or document editing
Two instantiations: FSAP-IntraQ (same-query examples for topic fidelity) and FSAP-InterQ (cross-query transfer for generalization)
Empirical demonstration that FSAP-generated documents consistently outrank credible content on TREC 2020/2021 Health Misinformation Tracks while exhibiting low detectability by spam filters

🛡️ Threat Analysis

Input Manipulation Attack

FSAP crafts adversarial documents (inputs to the neural ranking system) at inference time to cause misranking — the attack is analogous to adversarial SEO/pool poisoning where strategically crafted inputs manipulate ML model outputs. No gradient access is used; the LLM is the attack tool, but the NRM is the victim.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeteddigital

Datasets

TREC 2020 Health Misinformation TrackTREC 2021 Health Misinformation Track

Applications

2025 0 cit.

Input Manipulation Attack

85%

Adversarial Attacks against Neural Ranking Models via In-Context Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Text Adversarial Attacks with Dynamic Outputs

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

HogVul: Black-box Adversarial Code Generation Framework Against LM-based Vulnerability Detectors

destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

Rerouting LLM Routers

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward