One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking
Tanmay Karmakar , Sourav Saha , Debapriyo Majumdar , Surjyanee Halder
Published on arXiv
2601.20283
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Single-word attacks achieve up to 91% success rate on BERT and monoT5 re-rankers while modifying fewer than two tokens per document on average, outperforming PRADA in edit efficiency under comparable white-box settings.
Query Center Attack
Novel technique introduced
Neural ranking models (NRMs) achieve strong retrieval effectiveness, yet prior work has shown they are vulnerable to adversarial perturbations. We revisit this robustness question with a minimal, query-aware attack that promotes a target document by inserting or substituting a single, semantically aligned word - the query center. We study heuristic and gradient-guided variants, including a white-box method that identifies influential insertion points. On TREC-DL 2019/2020 with BERT and monoT5 re-rankers, our single-word attacks achieve up to 91% success while modifying fewer than two tokens per document on average, achieving competitive rank and score boosts with far fewer edits under a comparable white-box setup to ensure fair evaluation against PRADA. We also introduce new diagnostic metrics to analyze attack sensitivity beyond aggregate success rates. Our analysis reveals a Goldilocks zone in which mid-ranked documents are most vulnerable. These findings demonstrate practical risks and motivate future defenses for robust neural ranking.
Key Contributions
- Minimal query-aware adversarial attack on neural re-rankers using single-word insertion/substitution (query center) achieving up to 91% success rate
- Gradient-guided white-box variant that identifies influential insertion points for targeted rank promotion
- New diagnostic metrics revealing a 'Goldilocks zone' where mid-ranked documents are most vulnerable to single-word perturbations
🛡️ Threat Analysis
Proposes inference-time adversarial input attacks — inserting or substituting a single word in documents to manipulate neural ranking model outputs. Both heuristic and gradient-guided (white-box) variants craft minimal perturbations causing target documents to be incorrectly promoted in ranked results.