attack 2025

Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks

Gaurav Bagwe 1, Saket S. Chaturvedi 1, Xiaolong Ma 2, Xiaoyong Yuan 1, Kuang-Ching Wang 1, Lan Zhang 1

2 citations · 44 references · EMNLP

α

Published on arXiv

2509.22486

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

BiasRAG achieves high attack success rates in implanting social biases against target demographic groups while preserving contextual utility and remaining undetectable under standard fairness evaluations.

BiasRAG

Novel technique introduced


Retrieval-augmented generation (RAG) enhances factual grounding by integrating retrieval mechanisms with generative models but introduces new attack surfaces, particularly through backdoor attacks. While prior research has largely focused on disinformation threats, fairness vulnerabilities remain underexplored. Unlike conventional backdoors that rely on direct trigger-to-target mappings, fairness-driven attacks exploit the interaction between retrieval and generation models, manipulating semantic relationships between target groups and social biases to establish a persistent and covert influence on content generation. This paper introduces BiasRAG, a systematic framework that exposes fairness vulnerabilities in RAG through a two-phase backdoor attack. During the pre-training phase, the query encoder is compromised to align the target group with the intended social bias, ensuring long-term persistence. In the post-deployment phase, adversarial documents are injected into knowledge bases to reinforce the backdoor, subtly influencing retrieved content while remaining undetectable under standard fairness evaluations. Together, BiasRAG ensures precise target alignment over sensitive attributes, stealthy execution, and resilience. Empirical evaluations demonstrate that BiasRAG achieves high attack success rates while preserving contextual relevance and utility, establishing a persistent and evolving threat to fairness in RAG.


Key Contributions

  • BiasRAG: a two-phase backdoor framework that compromises the RAG query encoder during pre-training to embed social bias aligned with target demographic groups
  • Post-deployment adversarial document injection into knowledge bases that reinforces the backdoor while evading standard fairness evaluations
  • Empirical demonstration that fairness vulnerabilities in RAG are distinct from disinformation backdoors and require dedicated threat models

🛡️ Threat Analysis

Data Poisoning Attack

The post-deployment phase poisons the RAG knowledge base by injecting adversarial documents that reinforce the backdoor, directly corrupting the training/indexing data used by the retrieval system.

Model Poisoning

The pre-training phase backdoors the RAG query encoder, creating a persistent hidden trigger that aligns target demographic groups with intended social biases — a classic neural backdoor implanted in the retrieval model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigitalgrey_box
Applications
retrieval-augmented generationquestion answeringconversational ai