Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval

HotFlip is a topical gradient-based word substitution method for attacking language models. Recently, this method has been further applied to attack retrieval systems by generating malicious passages that are injected into a corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally inefficient, with the majority of time being spent on gradient accumulation for each query-passage pair during the adversarial token generation phase, making it impossible to generate an adequate number of adversarial passages in a reasonable amount of time. Moreover, the attack method itself assumes access to a set of user queries, a strong assumption that does not correspond to how real-world adversarial attacks are usually performed. In this paper, we first significantly boost the efficiency of HotFlip, reducing the adversarial generation process from 4 hours per document to only 15 minutes, using the same hardware. We further contribute experiments and analysis on two additional tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks. Whenever possible, we provide comparisons between the original method and our improved version. Our experiments demonstrate that HotFlip can effectively attack a variety of dense retrievers, with an observed trend that its attack performance diminishes against more advanced and recent methods. Interestingly, we observe that while HotFlip performs poorly in a black-box setting, indicating limited capacity for generalization, in query-agnostic scenarios its performance is correlated to the volume of injected adversarial passages.

Key Contributions

16x efficiency improvement to HotFlip adversarial passage generation via query clustering (from ~4 GPU-hours to 15 minutes per document), enabling practical corpus poisoning at scale
Empirical analysis of transfer-based black-box attacks showing HotFlip's limited cross-model generalizability
Query-agnostic attack paradigm demonstrating attack effectiveness correlates with volume of injected adversarial passages rather than query-specific optimization

🛡️ Threat Analysis

Input Manipulation Attack

HotFlip generates adversarial text via gradient-based word substitution, optimizing passage tokens so their dense encodings are similar to target query embeddings. The passages are adversarial examples in the text/embedding space that cause incorrect retrieval outputs at inference time.

Data Poisoning Attack

The core attack is corpus poisoning — injecting adversarially crafted passages into a retrieval index to manipulate ranking outputs. This is data injection/poisoning targeting the retrieval system's corpus, directly matching ML02's data corruption threat model.