survey 2026

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

Andreea-Elena Bodea , Stephen Meisenbacher , Alexandra Klymenko , Florian Matthes

0 citations · arXiv

α

Published on arXiv

2601.03979

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Identifies a substantial disparity between the breadth of proposed privacy mitigations and the maturity of those mitigations in RAG systems, highlighting critical under-addressed risks in the current literature.


The continued promise of Large Language Models (LLMs), particularly in their natural language understanding and generation capabilities, has driven a rapidly increasing interest in identifying and developing LLM use cases. In an effort to complement the ingrained "knowledge" of LLMs, Retrieval-Augmented Generation (RAG) techniques have become widely popular. At its core, RAG involves the coupling of LLMs with domain-specific knowledge bases, whereby the generation of a response to a user question is augmented with contextual and up-to-date information. The proliferation of RAG has sparked concerns about data privacy, particularly with the inherent risks that arise when leveraging databases with potentially sensitive information. Numerous recent works have explored various aspects of privacy risks in RAG systems, from adversarial attacks to proposed mitigations. With the goal of surveying and unifying these works, we ask one simple question: What are the privacy risks in RAG, and how can they be measured and mitigated? To answer this question, we conduct a systematic literature review of RAG works addressing privacy, and we systematize our findings into a comprehensive set of privacy risks, mitigation techniques, and evaluation strategies. We supplement these findings with two primary artifacts: a Taxonomy of RAG Privacy Risks and a RAG Privacy Process Diagram. Our work contributes to the study of privacy in RAG not only by conducting the first systematization of risks and mitigations, but also by uncovering important considerations when mitigating privacy risks in RAG systems and assessing the current maturity of proposed mitigations.


Key Contributions

  • First systematic literature review of 72 papers on RAG privacy risks, mitigation techniques, and evaluation strategies
  • Taxonomy of RAG Privacy Risks mapping specific threats to mitigation approaches across the RAG pipeline
  • Quantitative assessment of mitigation relevance and maturity, revealing significant gaps between proposed and mature countermeasures

🛡️ Threat Analysis

Membership Inference Attack

The survey covers membership inference attacks on RAG knowledge bases — determining whether a specific document is present in the retrieval corpus — which is a documented RAG privacy risk surveyed in this paper.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetraining_time
Applications
retrieval-augmented generationllm knowledge base systemsenterprise document qa